MyArxiv
Computation and Language
☆ RoboTwin: Dual-Arm Robot Benchmark with Generative Digital Twins (early version)
Effective collaboration of dual-arm robots and their tool use capabilities are increasingly important areas in the advancement of robotics. These skills play a significant role in expanding robots' ability to operate in diverse real-world environments. However, progress is impeded by the scarcity of specialized training data. This paper introduces RoboTwin, a novel benchmark dataset combining real-world teleoperated data with synthetic data from digital twins, designed for dual-arm robotic scenarios. Using the COBOT Magic platform, we have collected diverse data on tool usage and human-robot interaction. We present a innovative approach to creating digital twins using AI-generated content, transforming 2D images into detailed 3D models. Furthermore, we utilize large language models to generate expert-level training data and task-specific pose sequences oriented toward functionality. Our key contributions are: 1) the RoboTwin benchmark dataset, 2) an efficient real-to-simulation pipeline, and 3) the use of language models for automatic expert-level data generation. These advancements are designed to address the shortage of robotic training data, potentially accelerating the development of more capable and versatile robotic systems for a wide range of real-world applications. The project page is available at https://robotwin-benchmark.github.io/early-version/
comment: Project page: https://robotwin-benchmark.github.io/early-version/
☆ Masked Diffusion Models are Secretly Time-Agnostic Masked Models and Exploit Inaccurate Categorical Sampling
Masked diffusion models (MDMs) have emerged as a popular research topic for generative modeling of discrete data, thanks to their superior performance over other discrete diffusion models, and are rivaling the auto-regressive models (ARMs) for language modeling tasks. The recent effort in simplifying the masked diffusion framework further leads to alignment with continuous-space diffusion models and more principled training and sampling recipes. In this paper, however, we reveal that both training and sampling of MDMs are theoretically free from the time variable, arguably the key signature of diffusion models, and are instead equivalent to masked models. The connection on the sampling aspect is drawn by our proposed first-hitting sampler (FHS). Specifically, we show that the FHS is theoretically equivalent to MDMs' original generation process while significantly alleviating the time-consuming categorical sampling and achieving a 20$\times$ speedup. In addition, our investigation challenges previous claims that MDMs can surpass ARMs in generative perplexity. We identify, for the first time, an underlying numerical issue, even with the 32-bit floating-point precision, which results in inaccurate categorical sampling. We show that the numerical issue lowers the effective temperature both theoretically and empirically, leading to unfair assessments of MDMs' generation results in the previous literature.
comment: 40 pages
☆ LongCite: Enabling LLMs to Generate Fine-grained Citations in Long-context QA
Though current long-context large language models (LLMs) have demonstrated impressive capacities in answering user questions based on extensive text, the lack of citations in their responses makes user verification difficult, leading to concerns about their trustworthiness due to their potential hallucinations. In this work, we aim to enable long-context LLMs to generate responses with fine-grained sentence-level citations, improving their faithfulness and verifiability. We first introduce LongBench-Cite, an automated benchmark for assessing current LLMs' performance in Long-Context Question Answering with Citations (LQAC), revealing considerable room for improvement. To this end, we propose CoF (Coarse to Fine), a novel pipeline that utilizes off-the-shelf LLMs to automatically generate long-context QA instances with precise sentence-level citations, and leverage this pipeline to construct LongCite-45k, a large-scale SFT dataset for LQAC. Finally, we train LongCite-8B and LongCite-9B using the LongCite-45k dataset, successfully enabling their generation of accurate responses and fine-grained sentence-level citations in a single output. The evaluation results on LongBench-Cite show that our trained models achieve state-of-the-art citation quality, surpassing advanced proprietary models including GPT-4o.
☆ LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture
Expanding the long-context capabilities of Multi-modal Large Language Models~(MLLMs) is crucial for video understanding, high-resolution image understanding, and multi-modal agents. This involves a series of systematic optimizations, including model architecture, data construction and training strategy, particularly addressing challenges such as \textit{degraded performance with more images} and \textit{high computational costs}. In this paper, we adapt the model architecture to a hybrid of Mamba and Transformer blocks, approach data construction with both temporal and spatial dependencies among multiple images and employ a progressive training strategy. The released model \textbf{LongLLaVA}~(\textbf{Long}-Context \textbf{L}arge \textbf{L}anguage \textbf{a}nd \textbf{V}ision \textbf{A}ssistant) is the first hybrid MLLM, which achieved a better balance between efficiency and effectiveness. LongLLaVA not only achieves competitive results across various benchmarks, but also maintains high throughput and low memory consumption. Especially, it could process nearly a thousand images on a single A100 80GB GPU, showing promising application prospects for a wide range of tasks.
comment: 19 pages, 7 figures, 6 tables
☆ Configurable Foundation Models: Building LLMs from a Modular Perspective
Advancements in LLMs have recently unveiled challenges tied to computational efficiency and continual scalability due to their requirements of huge parameters, making the applications and evolution of these models on devices with limited computation resources and scenarios requiring various abilities increasingly cumbersome. Inspired by modularity within the human brain, there is a growing tendency to decompose LLMs into numerous functional modules, allowing for inference with part of modules and dynamic assembly of modules to tackle complex tasks, such as mixture-of-experts. To highlight the inherent efficiency and composability of the modular approach, we coin the term brick to represent each functional module, designating the modularized structure as configurable foundation models. In this paper, we offer a comprehensive overview and investigation of the construction, utilization, and limitation of configurable foundation models. We first formalize modules into emergent bricks - functional neuron partitions that emerge during the pre-training phase, and customized bricks - bricks constructed via additional post-training to improve the capabilities and knowledge of LLMs. Based on diverse functional bricks, we further present four brick-oriented operations: retrieval and routing, merging, updating, and growing. These operations allow for dynamic configuration of LLMs based on instructions to handle complex tasks. To verify our perspective, we conduct an empirical analysis on widely-used LLMs. We find that the FFN layers follow modular patterns with functional specialization of neurons and functional neuron partitions. Finally, we highlight several open issues and directions for future research. Overall, this paper aims to offer a fresh modular perspective on existing LLM research and inspire the future creation of more efficient and scalable foundational models.
☆ Historical German Text Normalization Using Type- and Token-Based Language Modeling
Historic variations of spelling poses a challenge for full-text search or natural language processing on historical digitized texts. To minimize the gap between the historic orthography and contemporary spelling, usually an automatic orthographic normalization of the historical source material is pursued. This report proposes a normalization system for German literary texts from c. 1700-1900, trained on a parallel corpus. The proposed system makes use of a machine learning approach using Transformer language models, combining an encoder-decoder model to normalize individual word types, and a pre-trained causal language model to adjust these normalizations within their context. An extensive evaluation shows that the proposed system provides state-of-the-art accuracy, comparable with a much larger fully end-to-end sentence-based normalization system, fine-tuning a pre-trained Transformer large language model. However, the normalization of historical text remains a challenge due to difficulties for models to generalize, and the lack of extensive high-quality parallel data.
comment: 27 pages, 3 figures
☆ R2GQA: Retriever-Reader-Generator Question Answering System to Support Students Understanding Legal Regulations in Higher Education
In this article, we propose the R2GQA system, a Retriever-Reader-Generator Question Answering system, consisting of three main components: Document Retriever, Machine Reader, and Answer Generator. The Retriever module employs advanced information retrieval techniques to extract the context of articles from a dataset of legal regulation documents. The Machine Reader module utilizes state-of-the-art natural language understanding algorithms to comprehend the retrieved documents and extract answers. Finally, the Generator module synthesizes the extracted answers into concise and informative responses to questions of students regarding legal regulations. Furthermore, we built the ViRHE4QA dataset in the domain of university training regulations, comprising 9,758 question-answer pairs with a rigorous construction process. This is the first Vietnamese dataset in the higher regulations domain with various types of answers, both extractive and abstractive. In addition, the R2GQA system is the first system to offer abstractive answers in Vietnamese. This paper discusses the design and implementation of each module within the R2GQA system on the ViRHE4QA dataset, highlighting their functionalities and interactions. Furthermore, we present experimental results demonstrating the effectiveness and utility of the proposed system in supporting the comprehension of students of legal regulations in higher education settings. In general, the R2GQA system and the ViRHE4QA dataset promise to contribute significantly to related research and help students navigate complex legal documents and regulations, empowering them to make informed decisions and adhere to institutional policies effectively. Our dataset is available for research purposes.
☆ Exploring Sentiment Dynamics and Predictive Behaviors in Cryptocurrency Discussions by Few-Shot Learning with Large Language Models
This study performs analysis of Predictive statements, Hope speech, and Regret Detection behaviors within cryptocurrency-related discussions, leveraging advanced natural language processing techniques. We introduce a novel classification scheme named "Prediction statements," categorizing comments into Predictive Incremental, Predictive Decremental, Predictive Neutral, or Non-Predictive categories. Employing GPT-4o, a cutting-edge large language model, we explore sentiment dynamics across five prominent cryptocurrencies: Cardano, Binance, Matic, Fantom, and Ripple. Our analysis reveals distinct patterns in predictive sentiments, with Matic demonstrating a notably higher propensity for optimistic predictions. Additionally, we investigate hope and regret sentiments, uncovering nuanced interplay between these emotions and predictive behaviors. Despite encountering limitations related to data volume and resource availability, our study reports valuable discoveries concerning investor behavior and sentiment trends within the cryptocurrency market, informing strategic decision-making and future research endeavors.
☆ CMM-Math: A Chinese Multimodal Math Dataset To Evaluate and Enhance the Mathematics Reasoning of Large Multimodal Models
Large language models (LLMs) have obtained promising results in mathematical reasoning, which is a foundational skill for human intelligence. Most previous studies focus on improving and measuring the performance of LLMs based on textual math reasoning datasets (e.g., MATH, GSM8K). Recently, a few researchers have released English multimodal math datasets (e.g., MATHVISTA and MATH-V) to evaluate the effectiveness of large multimodal models (LMMs). In this paper, we release a Chinese multimodal math (CMM-Math) dataset, including benchmark and training parts, to evaluate and enhance the mathematical reasoning of LMMs. CMM-Math contains over 28,000 high-quality samples, featuring a variety of problem types (e.g., multiple-choice, fill-in-the-blank, and so on) with detailed solutions across 12 grade levels from elementary to high school in China. Specifically, the visual context may be present in the questions or opinions, which makes this dataset more challenging. Through comprehensive analysis, we discover that state-of-the-art LMMs on the CMM-Math dataset face challenges, emphasizing the necessity for further improvements in LMM development. We also propose a Multimodal Mathematical LMM (Math-LMM) to handle the problems with mixed input of multiple images and text segments. We train our model using three stages, including foundational pre-training, foundational fine-tuning, and mathematical fine-tuning. The extensive experiments indicate that our model effectively improves math reasoning performance by comparing it with the SOTA LMMs over three multimodal mathematical datasets.
☆ MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark
This paper introduces MMMU-Pro, a robust version of the Massive Multi-discipline Multimodal Understanding and Reasoning (MMMU) benchmark. MMMU-Pro rigorously assesses multimodal models' true understanding and reasoning capabilities through a three-step process based on MMMU: (1) filtering out questions answerable by text-only models, (2) augmenting candidate options, and (3) introducing a vision-only input setting where questions are embedded within images. This setting challenges AI to truly "see" and "read" simultaneously, testing a fundamental human cognitive skill of seamlessly integrating visual and textual information. Results show that model performance is substantially lower on MMMU-Pro than on MMMU, ranging from 16.8% to 26.9% across models. We explore the impact of OCR prompts and Chain of Thought (CoT) reasoning, finding that OCR prompts have minimal effect while CoT generally improves performance. MMMU-Pro provides a more rigorous evaluation tool, closely mimicking real-world scenarios and offering valuable directions for future research in multimodal AI.
☆ Towards a Unified View of Preference Learning for Large Language Models: A Survey
Large Language Models (LLMs) exhibit remarkably powerful capabilities. One of the crucial factors to achieve success is aligning the LLM's output with human preferences. This alignment process often requires only a small amount of data to efficiently enhance the LLM's performance. While effective, research in this area spans multiple domains, and the methods involved are relatively complex to understand. The relationships between different methods have been under-explored, limiting the development of the preference alignment. In light of this, we break down the existing popular alignment strategies into different components and provide a unified framework to study the current alignment strategies, thereby establishing connections among them. In this survey, we decompose all the strategies in preference learning into four components: model, data, feedback, and algorithm. This unified view offers an in-depth understanding of existing alignment algorithms and also opens up possibilities to synergize the strengths of different strategies. Furthermore, we present detailed working examples of prevalent existing algorithms to facilitate a comprehensive understanding for the readers. Finally, based on our unified perspective, we explore the challenges and future research directions for aligning large language models with human preferences.
comment: Initial Commit, 21 pages
☆ A Comparative Study of Pre-training and Self-training
Pre-training and self-training are two approaches to semi-supervised learning. The comparison between pre-training and self-training has been explored. However, the previous works led to confusing findings: self-training outperforms pre-training experienced on some tasks in computer vision, and contrarily, pre-training outperforms self-training experienced on some tasks in natural language processing, under certain conditions of incomparable settings. We propose, comparatively and exhaustively, an ensemble method to empirical study all feasible training paradigms combining pre-training, self-training, and fine-tuning within consistent foundational settings comparable to data augmentation. We conduct experiments on six datasets, four data augmentation, and imbalanced data for sentiment analysis and natural language inference tasks. Our findings confirm that the pre-training and fine-tuning paradigm yields the best overall performances. Moreover, self-training offers no additional benefits when combined with semi-supervised pre-training.
comment: 19 pages, 2 figures, 9 tables
☆ Pooling And Attention: What Are Effective Designs For LLm-Based Embedding Models?
The significant advancements of Large Language Models (LLMs) in generative tasks have led to a growing body of work exploring LLM-based embedding models. While these models, employing different pooling and attention strategies, have achieved state-of-the-art performance on public embedding benchmarks, questions still arise about what constitutes an effective design for LLM-based embedding models. However, these models are often trained on different datasets, using different LLM base models or training settings. Moreover, evaluations on public embedding benchmarks often fail to report statistical significance, making it difficult to determine which designs truly contribute to final performance. This complicates the process for practitioners seeking optimal training recipes for LLM-based embedding models. In this study, we conduct a large-scale experiment by training a series of LLM-based embedding models using the same training data and base model but differing in their pooling and attention strategies. The results show that there is no one-size-fits-all solution: while bidirectional attention and an additional trainable pooling layer outperform in text similarity and information retrieval tasks, they do not significantly surpass simpler designs like EOS-last token pooling and default causal attention in clustering and classification tasks. Furthermore, we propose a new pooling strategy, Multi-Layers Trainable Pooling, which transforms the outputs of all hidden layers, rather than just the last layer, using a cross-attention network. This method proves to be statistically superior in text similarity and retrieval tasks compared to existing pooling methods. Overall, this paper sheds light on effective training strategies for LLM-based embedding models.
comment: https://github.com/yixuantt/PoolingAndAttn
Pre-training data selection for biomedical domain adaptation using journal impact metrics
Domain adaptation is a widely used method in natural language processing (NLP) to improve the performance of a language model within a specific domain. This method is particularly common in the biomedical domain, which sees regular publication of numerous scientific articles. PubMed, a significant corpus of text, is frequently used in the biomedical domain. The primary objective of this study is to explore whether refining a pre-training dataset using specific quality metrics for scientific papers can enhance the performance of the resulting model. To accomplish this, we employ two straightforward journal impact metrics and conduct experiments by continually pre-training BERT on various subsets of the complete PubMed training set, we then evaluate the resulting models on biomedical language understanding tasks from the BLURB benchmark. Our results show that pruning using journal impact metrics is not efficient. But we also show that pre-training using fewer abstracts (but with the same number of training steps) does not necessarily decrease the resulting model's performance.
☆ Alignment-Aware Model Extraction Attacks on Large Language Models
Model extraction attacks (MEAs) on large language models (LLMs) have received increasing research attention lately. Existing attack methods on LLMs inherit the extraction strategies from those designed for deep neural networks (DNNs) yet neglect the inconsistency of training tasks between MEA and LLMs' alignments. As such, they result in poor attack performances. To tackle this issue, we present Locality Reinforced Distillation (LoRD), a novel model extraction attack algorithm specifically for LLMs. In particular, we design a policy-gradient-style training task, which utilizes victim models' responses as a signal to guide the crafting of preference for the local model. Theoretical analysis has shown that i) LoRD's convergence procedure in MEAs is consistent with the alignments of LLMs, and ii) LoRD can reduce query complexity while mitigating watermark protection through exploration-based stealing. Extensive experiments on domain-specific extractions demonstrate the superiority of our method by examining the extraction of various state-of-the-art commercial LLMs.
comment: Source code: https://github.com/liangzid/alignmentExtraction
☆ A Data Selection Approach for Enhancing Low Resource Machine Translation Using Cross-Lingual Sentence Representations
Machine translation in low-resource language pairs faces significant challenges due to the scarcity of parallel corpora and linguistic resources. This study focuses on the case of English-Marathi language pairs, where existing datasets are notably noisy, impeding the performance of machine translation models. To mitigate the impact of data quality issues, we propose a data filtering approach based on cross-lingual sentence representations. Our methodology leverages a multilingual SBERT model to filter out problematic translations in the training data. Specifically, we employ an IndicSBERT similarity model to assess the semantic equivalence between original and translated sentences, allowing us to retain linguistically correct translations while discarding instances with substantial deviations. The results demonstrate a significant improvement in translation quality over the baseline post-filtering with IndicSBERT. This illustrates how cross-lingual sentence representations can reduce errors in machine translation scenarios with limited resources. By integrating multilingual sentence BERT models into the translation pipeline, this research contributes to advancing machine translation techniques in low-resource environments. The proposed method not only addresses the challenges in English-Marathi language pairs but also provides a valuable framework for enhancing translation quality in other low-resource language translation tasks.
comment: Accepted at I2CT 2024
☆ Detecting Calls to Action in Multimodal Content: Analysis of the 2021 German Federal Election Campaign on Instagram
This study investigates the automated classification of Calls to Action (CTAs) within the 2021 German Instagram election campaign to advance the understanding of mobilization in social media contexts. We analyzed over 2,208 Instagram stories and 712 posts using fine-tuned BERT models and OpenAI's GPT-4 models. The fine-tuned BERT model incorporating synthetic training data achieved a macro F1 score of 0.93, demonstrating a robust classification performance. Our analysis revealed that 49.58% of Instagram posts and 10.64% of stories contained CTAs, highlighting significant differences in mobilization strategies between these content types. Additionally, we found that FDP and the Greens had the highest prevalence of CTAs in posts, whereas CDU and CSU led in story CTAs.
comment: Accepted Archival Paper for the CPSS Workshop at KONVENS 2024. Camera Ready Submission
☆ Deconfounded Causality-aware Parameter-Efficient Fine-Tuning for Problem-Solving Improvement of LLMs
Large Language Models (LLMs) have demonstrated remarkable efficiency in tackling various tasks based on human instructions, but recent studies reveal that these models often fail to achieve satisfactory results on questions involving reasoning, such as mathematics or physics questions. This phenomenon is usually attributed to the uncertainty regarding whether these models could genuinely comprehend the knowledge embedded in the text or merely learn to replicate the token distribution without a true understanding of the content. In this paper, we delve into this problem and aim to enhance the reasoning capabilities of LLMs. First, we investigate if the model has genuine reasoning capabilities by visualizing the text generation process at the attention and representation level. Then, we formulate the reasoning process of LLMs into a causal framework, which provides a formal explanation of the problems we observe in the visualization. Finally, building upon this causal framework, we propose Deconfounded Causal Adaptation (DCA), a novel parameter-efficient fine-tuning (PEFT) method to enhance the model's reasoning capabilities by encouraging the model to extract the general problem-solving skills and apply these skills to different questions. Experiments show that our method outperforms the baseline consistently across multiple benchmarks, and with only 1.2M tunable parameters, we achieve better or comparable results to other fine-tuning methods. This demonstrates the effectiveness and efficiency of our method in improving the overall accuracy and reliability of LLMs.
☆ Creating Domain-Specific Translation Memories for Machine Translation Fine-tuning: The TRENCARD Bilingual Cardiology Corpus
This article investigates how translation memories (TM) can be created by translators or other language professionals in order to compile domain-specific parallel corpora , which can then be used in different scenarios, such as machine translation training and fine-tuning, TM leveraging, and/or large language model fine-tuning. The article introduces a semi-automatic TM preparation methodology leveraging primarily translation tools used by translators in favor of data quality and control by the translators. This semi-automatic methodology is then used to build a cardiology-based Turkish -> English corpus from bilingual abstracts of Turkish cardiology journals. The resulting corpus called TRENCARD Corpus has approximately 800,000 source words and 50,000 sentences. Using this methodology, translators can build their custom TMs in a reasonable time and use them in their bilingual data requiring tasks.
☆ OpenFact at CheckThat! 2024: Combining Multiple Attack Methods for Effective Adversarial Text Generation
This paper presents the experiments and results for the CheckThat! Lab at CLEF 2024 Task 6: Robustness of Credibility Assessment with Adversarial Examples (InCrediblAE). The primary objective of this task was to generate adversarial examples in five problem domains in order to evaluate the robustness of widely used text classification methods (fine-tuned BERT, BiLSTM, and RoBERTa) when applied to credibility assessment issues. This study explores the application of ensemble learning to enhance adversarial attacks on natural language processing (NLP) models. We systematically tested and refined several adversarial attack methods, including BERT-Attack, Genetic algorithms, TextFooler, and CLARE, on five datasets across various misinformation tasks. By developing modified versions of BERT-Attack and hybrid methods, we achieved significant improvements in attack effectiveness. Our results demonstrate the potential of modification and combining multiple methods to create more sophisticated and effective adversarial attack strategies, contributing to the development of more robust and secure systems.
comment: CLEF 2024 - Conference and Labs of the Evaluation Forum
☆ A Survey on Emergent Language
The field of emergent language represents a novel area of research within the domain of artificial intelligence, particularly within the context of multi-agent reinforcement learning. Although the concept of studying language emergence is not new, early approaches were primarily concerned with explaining human language formation, with little consideration given to its potential utility for artificial agents. In contrast, studies based on reinforcement learning aim to develop communicative capabilities in agents that are comparable to or even superior to human language. Thus, they extend beyond the learned statistical representations that are common in natural language processing research. This gives rise to a number of fundamental questions, from the prerequisites for language emergence to the criteria for measuring its success. This paper addresses these questions by providing a comprehensive review of 181 scientific publications on emergent language in artificial intelligence. Its objective is to serve as a reference for researchers interested in or proficient in the field. Consequently, the main contributions are the definition and overview of the prevailing terminology, the analysis of existing evaluation methods and metrics, and the description of the identified research gaps.
☆ PUB: Plot Understanding Benchmark and Dataset for Evaluating Large Language Models on Synthetic Visual Data Interpretation
The ability of large language models (LLMs) to interpret visual representations of data is crucial for advancing their application in data analysis and decision-making processes. This paper presents a novel synthetic dataset designed to evaluate the proficiency of LLMs in interpreting various forms of data visualizations, including plots like time series, histograms, violins, boxplots, and clusters. Our dataset is generated using controlled parameters to ensure comprehensive coverage of potential real-world scenarios. We employ multimodal text prompts with questions related to visual data in images to benchmark several state-of-the-art models like ChatGPT or Gemini, assessing their understanding and interpretative accuracy. To ensure data integrity, our benchmark dataset is generated automatically, making it entirely new and free from prior exposure to the models being tested. This strategy allows us to evaluate the models' ability to truly interpret and understand the data, eliminating possibility of pre-learned responses, and allowing for an unbiased evaluation of the models' capabilities. We also introduce quantitative metrics to assess the performance of the models, providing a robust and comprehensive evaluation tool. Benchmarking several state-of-the-art LLMs with this dataset reveals varying degrees of success, highlighting specific strengths and weaknesses in interpreting diverse types of visual data. The results provide valuable insights into the current capabilities of LLMs and identify key areas for improvement. This work establishes a foundational benchmark for future research and development aimed at enhancing the visual interpretative abilities of language models. In the future, improved LLMs with robust visual interpretation skills can significantly aid in automated data analysis, scientific research, educational tools, and business intelligence applications.
☆ An Analysis of Linear Complexity Attention Substitutes with BEST-RQ
Self-Supervised Learning (SSL) has proven to be effective in various domains, including speech processing. However, SSL is computationally and memory expensive. This is in part due the quadratic complexity of multi-head self-attention (MHSA). Alternatives for MHSA have been proposed and used in the speech domain, but have yet to be investigated properly in an SSL setting. In this work, we study the effects of replacing MHSA with recent state-of-the-art alternatives that have linear complexity, namely, HyperMixing, Fastformer, SummaryMixing, and Mamba. We evaluate these methods by looking at the speed, the amount of VRAM consumed, and the performance on the SSL MP3S benchmark. Results show that these linear alternatives maintain competitive performance compared to MHSA while, on average, decreasing VRAM consumption by around 20% to 60% and increasing speed from 7% to 65% for input sequences ranging from 20 to 80 seconds.
comment: Accepted in the IEEE Soken Language Technology Workshop 2024
☆ More is More: Addition Bias in Large Language Models
In this paper, we investigate the presence of additive bias in Large Language Models (LLMs), drawing a parallel to the cognitive bias observed in humans where individuals tend to favor additive over subtractive changes. Using a series of controlled experiments, we tested various LLMs, including GPT-3.5 Turbo, Claude 3.5 Sonnet, Mistral, Math$\Sigma$tral, and Llama 3.1, on tasks designed to measure their propensity for additive versus subtractive modifications. Our findings demonstrate a significant preference for additive changes across all tested models. For example, in a palindrome creation task, Llama 3.1 favored adding letters 97.85% of the time over removing them. Similarly, in a Lego tower balancing task, GPT-3.5 Turbo chose to add a brick 76.38% of the time rather than remove one. In a text summarization task, Mistral 7B produced longer summaries in 59.40% to 75.10% of cases when asked to improve its own or others' writing. These results indicate that, similar to humans, LLMs exhibit a marked additive bias, which might have implications when LLMs are used on a large scale. Addittive bias might increase resource use and environmental impact, leading to higher economic costs due to overconsumption and waste. This bias should be considered in the development and application of LLMs to ensure balanced and efficient problem-solving approaches.
comment: 25 pages, 8 figures
☆ Language is Scary when Over-Analyzed: Unpacking Implied Misogynistic Reasoning with Argumentation Theory-Driven Prompts
We propose misogyny detection as an Argumentative Reasoning task and we investigate the capacity of large language models (LLMs) to understand the implicit reasoning used to convey misogyny in both Italian and English. The central aim is to generate the missing reasoning link between a message and the implied meanings encoding the misogyny. Our study uses argumentation theory as a foundation to form a collection of prompts in both zero-shot and few-shot settings. These prompts integrate different techniques, including chain-of-thought reasoning and augmented knowledge. Our findings show that LLMs fall short on reasoning capabilities about misogynistic comments and that they mostly rely on their implicit knowledge derived from internalized common stereotypes about women to generate implied assumptions, rather than on inductive reasoning.
☆ Word and Phrase Features in Graph Convolutional Network for Automatic Question Classification
Effective question classification is crucial for AI-driven educational tools, enabling adaptive learning systems to categorize questions by skill area, difficulty level, and competence. This classification not only supports educational diagnostics and analytics but also enhances complex tasks like information retrieval and question answering by associating questions with relevant categories. Traditional methods, often based on word embeddings and conventional classifiers, struggle to capture the nuanced relationships in natural language, leading to suboptimal performance. To address this, we propose a novel approach leveraging graph convolutional networks (GCNs), named Phrase Question-Graph Convolutional Network (PQ-GCN) to better model the inherent structure of questions. By representing questions as graphs -- where nodes signify words or phrases and edges denote syntactic or semantic relationships -- our method allows GCNs to learn from the interconnected nature of language more effectively. Additionally, we explore the incorporation of phrase-based features to enhance classification accuracy, especially in low-resource settings. Our findings demonstrate that GCNs, augmented with these features, offer a promising solution for more accurate and context-aware question classification, bridging the gap between graph neural network research and practical educational applications.
☆ A Comparative Study on Large Language Models for Log Parsing
Background: Log messages provide valuable information about the status of software systems. This information is provided in an unstructured fashion and automated approaches are applied to extract relevant parameters. To ease this process, log parsing can be applied, which transforms log messages into structured log templates. Recent advances in language models have led to several studies that apply ChatGPT to the task of log parsing with promising results. However, the performance of other state-of-the-art large language models (LLMs) on the log parsing task remains unclear. Aims: In this study, we investigate the current capability of state-of-the-art LLMs to perform log parsing. Method: We select six recent LLMs, including both paid proprietary (GPT-3.5, Claude 2.1) and four free-to-use open models, and compare their performance on system logs obtained from a selection of mature open-source projects. We design two different prompting approaches and apply the LLMs on 1, 354 log templates across 16 different projects. We evaluate their effectiveness, in the number of correctly identified templates, and the syntactic similarity between the generated templates and the ground truth. Results: We found that free-to-use models are able to compete with paid models, with CodeLlama extracting 10% more log templates correctly than GPT-3.5. Moreover, we provide qualitative insights into the usability of language models (e.g., how easy it is to use their responses). Conclusions: Our results reveal that some of the smaller, free-to-use LLMs can considerably assist log parsing compared to their paid proprietary competitors, especially code-specialized models.
comment: Accepted for publication in the 18th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM '24)
☆ DetectiveQA: Evaluating Long-Context Reasoning on Detective Novels
With the rapid advancement of Large Language Models (LLMs), long-context information understanding and processing have become a hot topic in academia and industry. However, benchmarks for evaluating the ability of LLMs to handle long-context information do not seem to have kept pace with the development of LLMs. Despite the emergence of various long-context evaluation benchmarks, the types of capability assessed are still limited, without new capability dimensions. In this paper, we introduce DetectiveQA, a narrative reasoning benchmark featured with an average context length of over 100K tokens. DetectiveQA focuses on evaluating the long-context reasoning ability of LLMs, which not only requires a full understanding of context but also requires extracting important evidences from the context and reasoning according to extracted evidences to answer the given questions. This is a new dimension of capability evaluation, which is more in line with the current intelligence level of LLMs. We use detective novels as data sources, which naturally have various reasoning elements. Finally, we manually annotated 600 questions in Chinese and then also provided an English edition of the context information and questions. We evaluate many long-context LLMs on DetectiveQA, including commercial and open-sourced models, and the results indicate that existing long-context LLMs still require significant advancements to effectively process true long-context dependency questions.
☆ What is lost in Normalization? Exploring Pitfalls in Multilingual ASR Model Evaluations EMNLP 2024
This paper explores the pitfalls in evaluating multilingual automatic speech recognition (ASR) models, with a particular focus on Indic language scripts. We investigate the text normalization routine employed by leading ASR models, including OpenAI Whisper, Meta's MMS, Seamless, and Assembly AI's Conformer, and their unintended consequences on performance metrics. Our research reveals that current text normalization practices, while aiming to standardize ASR outputs for fair comparison, by removing inconsistencies such as variations in spelling, punctuation, and special characters, are fundamentally flawed when applied to Indic scripts. Through empirical analysis using text similarity scores and in-depth linguistic examination, we demonstrate that these flaws lead to artificially inflated performance metrics for Indic languages. We conclude by proposing a shift towards developing normalization routines that leverage native linguistic expertise, ensuring more robust and accurate evaluations of multilingual ASR models.
comment: Sumbitted to EMNLP 2024
☆ Large Language Models as Efficient Reward Function Searchers for Custom-Environment Multi-Objective Reinforcement Learning
Leveraging large language models (LLMs) for designing reward functions demonstrates significant potential. However, achieving effective design and improvement of reward functions in reinforcement learning (RL) tasks with complex custom environments and multiple requirements presents considerable challenges. In this paper, we enable LLMs to be effective white-box searchers, highlighting their advanced semantic understanding capabilities. Specifically, we generate reward components for each explicit user requirement and employ the reward critic to identify the correct code form. Then, LLMs assign weights to the reward components to balance their values and iteratively search and optimize these weights based on the context provided by the training log analyzer, while adaptively determining the search step size. We applied the framework to an underwater information collection RL task without direct human feedback or reward examples (zero-shot). The reward critic successfully correct the reward code with only one feedback for each requirement, effectively preventing irreparable errors that can occur when reward function feedback is provided in aggregate. The effective initialization of weights enables the acquisition of different reward functions within the Pareto solution set without weight search. Even in the case where a weight is 100 times off, fewer than four iterations are needed to obtain solutions that meet user requirements. The framework also works well with most prompts utilizing GPT-3.5 Turbo, since it does not require advanced numerical understanding or calculation.
☆ Abstractive Text Summarization: State of the Art, Challenges, and Improvements
Specifically focusing on the landscape of abstractive text summarization, as opposed to extractive techniques, this survey presents a comprehensive overview, delving into state-of-the-art techniques, prevailing challenges, and prospective research directions. We categorize the techniques into traditional sequence-to-sequence models, pre-trained large language models, reinforcement learning, hierarchical methods, and multi-modal summarization. Unlike prior works that did not examine complexities, scalability and comparisons of techniques in detail, this review takes a comprehensive approach encompassing state-of-the-art methods, challenges, solutions, comparisons, limitations and charts out future improvements - providing researchers an extensive overview to advance abstractive summarization research. We provide vital comparison tables across techniques categorized - offering insights into model complexity, scalability and appropriate applications. The paper highlights challenges such as inadequate meaning representation, factual consistency, controllable text summarization, cross-lingual summarization, and evaluation metrics, among others. Solutions leveraging knowledge incorporation and other innovative strategies are proposed to address these challenges. The paper concludes by highlighting emerging research areas like factual inconsistency, domain-specific, cross-lingual, multilingual, and long-document summarization, as well as handling noisy data. Our objective is to provide researchers and practitioners with a structured overview of the domain, enabling them to better understand the current landscape and identify potential areas for further research and improvement.
comment: 9 Tables, 7 Figures
☆ Determination of language families using deep learning
We use a c-GAN (convolutional generative adversarial) neural network to analyze transliterated text fragments of extant, dead comprehensible, and one dead non-deciphered (Cypro-Minoan) language to establish linguistic affinities. The paper is agnostic with respect to translation and/or deciphering. However, there is hope that the proposed approach can be useful for decipherment with more sophisticated neural network techniques.
comment: First draft. Comments are welcome
☆ Large Language Models and Cognitive Science: A Comprehensive Review of Similarities, Differences, and Challenges
This comprehensive review explores the intersection of Large Language Models (LLMs) and cognitive science, examining similarities and differences between LLMs and human cognitive processes. We analyze methods for evaluating LLMs cognitive abilities and discuss their potential as cognitive models. The review covers applications of LLMs in various cognitive fields, highlighting insights gained for cognitive science research. We assess cognitive biases and limitations of LLMs, along with proposed methods for improving their performance. The integration of LLMs with cognitive architectures is examined, revealing promising avenues for enhancing artificial intelligence (AI) capabilities. Key challenges and future research directions are identified, emphasizing the need for continued refinement of LLMs to better align with human cognition. This review provides a balanced perspective on the current state and future potential of LLMs in advancing our understanding of both artificial and human intelligence.
comment: 10 pages, 1 figure
☆ STAB: Speech Tokenizer Assessment Benchmark
Representing speech as discrete tokens provides a framework for transforming speech into a format that closely resembles text, thus enabling the use of speech as an input to the widely successful large language models (LLMs). Currently, while several speech tokenizers have been proposed, there is ambiguity regarding the properties that are desired from a tokenizer for specific downstream tasks and its overall generalizability. Evaluating the performance of tokenizers across different downstream tasks is a computationally intensive effort that poses challenges for scalability. To circumvent this requirement, we present STAB (Speech Tokenizer Assessment Benchmark), a systematic evaluation framework designed to assess speech tokenizers comprehensively and shed light on their inherent characteristics. This framework provides a deeper understanding of the underlying mechanisms of speech tokenization, thereby offering a valuable resource for expediting the advancement of future tokenizer models and enabling comparative analysis using a standardized benchmark. We evaluate the STAB metrics and correlate this with downstream task performance across a range of speech tasks and tokenizer choices.
comment: 5 pages
☆ How Privacy-Savvy Are Large Language Models? A Case Study on Compliance and Privacy Technical Review
The recent advances in large language models (LLMs) have significantly expanded their applications across various fields such as language generation, summarization, and complex question answering. However, their application to privacy compliance and technical privacy reviews remains under-explored, raising critical concerns about their ability to adhere to global privacy standards and protect sensitive user data. This paper seeks to address this gap by providing a comprehensive case study evaluating LLMs' performance in privacy-related tasks such as privacy information extraction (PIE), legal and regulatory key point detection (KPD), and question answering (QA) with respect to privacy policies and data protection regulations. We introduce a Privacy Technical Review (PTR) framework, highlighting its role in mitigating privacy risks during the software development life-cycle. Through an empirical assessment, we investigate the capacity of several prominent LLMs, including BERT, GPT-3.5, GPT-4, and custom models, in executing privacy compliance checks and technical privacy reviews. Our experiments benchmark the models across multiple dimensions, focusing on their precision, recall, and F1-scores in extracting privacy-sensitive information and detecting key regulatory compliance points. While LLMs show promise in automating privacy reviews and identifying regulatory discrepancies, significant gaps persist in their ability to fully comply with evolving legal standards. We provide actionable recommendations for enhancing LLMs' capabilities in privacy compliance, emphasizing the need for robust model improvements and better integration with legal and regulatory requirements. This study underscores the growing importance of developing privacy-aware LLMs that can both support businesses in compliance efforts and safeguard user privacy rights.
comment: 8 pages, 4 figures
☆ Do Large Language Models Possess Sensitive to Sentiment?
Large Language Models (LLMs) have recently displayed their extraordinary capabilities in language understanding. However, how to comprehensively assess the sentiment capabilities of LLMs continues to be a challenge. This paper investigates the ability of LLMs to detect and react to sentiment in text modal. As the integration of LLMs into diverse applications is on the rise, it becomes highly critical to comprehend their sensitivity to emotional tone, as it can influence the user experience and the efficacy of sentiment-driven tasks. We conduct a series of experiments to evaluate the performance of several prominent LLMs in identifying and responding appropriately to sentiments like positive, negative, and neutral emotions. The models' outputs are analyzed across various sentiment benchmarks, and their responses are compared with human evaluations. Our discoveries indicate that although LLMs show a basic sensitivity to sentiment, there are substantial variations in their accuracy and consistency, emphasizing the requirement for further enhancements in their training processes to better capture subtle emotional cues. Take an example in our findings, in some cases, the models might wrongly classify a strongly positive sentiment as neutral, or fail to recognize sarcasm or irony in the text. Such misclassifications highlight the complexity of sentiment analysis and the areas where the models need to be refined. Another aspect is that different LLMs might perform differently on the same set of data, depending on their architecture and training datasets. This variance calls for a more in-depth study of the factors that contribute to the performance differences and how they can be optimized.
comment: 10 pages, 2 figures
☆ Diversify-verify-adapt: Efficient and Robust Retrieval-Augmented Ambiguous Question Answering
The retrieval augmented generation (RAG) framework addresses an ambiguity in user queries in QA systems by retrieving passages that cover all plausible interpretations and generating comprehensive responses based on the passages. However, our preliminary studies reveal that a single retrieval process often suffers from low quality results, as the retrieved passages frequently fail to capture all plausible interpretations. Although the iterative RAG approach has been proposed to address this problem, it comes at the cost of significantly reduced efficiency. To address these issues, we propose the diversify-verify-adapt (DIVA) framework. DIVA first diversifies the retrieved passages to encompass diverse interpretations. Subsequently, DIVA verifies the quality of the passages and adapts the most suitable approach tailored to their quality. This approach improves the QA systems accuracy and robustness by handling low quality retrieval issue in ambiguous questions, while enhancing efficiency.
☆ NUDGE: Lightweight Non-Parametric Fine-Tuning of Embeddings for Retrieval
$k$-Nearest Neighbor search on dense vector embeddings ($k$-NN retrieval) from pre-trained embedding models is the predominant retrieval method for text and images, as well as Retrieval-Augmented Generation (RAG) pipelines. In practice, application developers often fine-tune the embeddings to improve their accuracy on the dataset and query workload in hand. Existing approaches either fine-tune the pre-trained model itself or, more efficiently, but at the cost of accuracy, train adaptor models to transform the output of the pre-trained model. We present NUDGE, a family of novel non-parametric embedding fine-tuning approaches that are significantly more accurate and efficient than both sets of existing approaches. NUDGE directly modifies the embeddings of data records to maximize the accuracy of $k$-NN retrieval. We present a thorough theoretical and experimental study of NUDGE's non-parametric approach. We show that even though the underlying problem is NP-Hard, constrained variations can be solved efficiently. These constraints additionally ensure that the changes to the embeddings are modest, avoiding large distortions to the semantics learned during pre-training. In experiments across five pre-trained models and nine standard text and image retrieval datasets, NUDGE runs in minutes and often improves NDCG@10 by more than 10% over existing fine-tuning methods. On average, NUDGE provides 3.3x and 4.3x higher increase in accuracy and runs 200x and 3x faster, respectively, over fine-tuning the pre-trained model and training adaptors.
♻ ☆ LADDER: Language Driven Slice Discovery and Error Rectification
Error slice discovery associates structured patterns with model errors. Existing methods discover error slices by clustering the error-prone samples with similar patterns or assigning discrete attributes to each sample for post-hoc analysis. While these methods aim for interpretability and easier mitigation through reweighting or rebalancing, they may not capture the full complexity of error patterns due to incomplete or missing attributes. Contrary to the existing approach, this paper utilizes the reasoning capabilities of the Large Language Model (LLM) to analyze complex error patterns and generate testable hypotheses. This paper proposes LADDER: Language Driven slice Discovery and Error Rectification. It first projects the model's representation into a language-aligned feature space (eg CLIP) to preserve semantics in the original model feature space. This ensures the accurate retrieval of sentences that highlight the model's errors. Next, the LLM utilizes the sentences and generates hypotheses to discover error slices. Finally, we mitigate the error by fine-tuning the classification head by creating a group-balanced dataset using the hypotheses. Our entire method does not require any attribute annotation, either explicitly or through external tagging models. We validate our method with \textbf{five} image classification datasets. The code is available (https://github.com/batmanlab/Ladder).
♻ ☆ The Need for Guardrails with Large Language Models in Medical Safety-Critical Settings: An Artificial Intelligence Application in the Pharmacovigilance Ecosystem
Large language models (LLMs) are useful tools with the capacity for performing specific types of knowledge work at an effective scale. However, LLM deployments in high-risk and safety-critical domains pose unique challenges, notably the issue of ``hallucination,'' where LLMs can generate fabricated information. This is particularly concerning in settings such as drug safety, where inaccuracies could lead to patient harm. To mitigate these risks, we have developed and demonstrated a proof of concept suite of guardrails specifically designed to mitigate certain types of hallucinations and errors for drug safety, and potentially applicable to other medical safety-critical contexts. These guardrails include mechanisms to detect anomalous documents to prevent the ingestion of inappropriate data, identify incorrect drug names or adverse event terms, and convey uncertainty in generated content. We integrated these guardrails with an LLM fine-tuned for a text-to-text task, which involves converting both structured and unstructured data within adverse event reports into natural language. This method was applied to translate individual case safety reports, demonstrating effective application in a pharmacovigilance processing task. Our guardrail framework offers a set of tools with broad applicability across various domains, ensuring LLMs can be safely used in high-risk situations by eliminating the occurrence of key errors, including the generation of incorrect pharmacovigilance-related terms, thus adhering to stringent regulatory and quality standards in medical safety-critical environments.
comment: 27 pages, 6 figures, 4 tables and supplementary material provided
♻ ☆ Simple and Scalable Strategies to Continually Pre-train Large Language Models
Large language models (LLMs) are routinely pre-trained on billions of tokens, only to start the process over again once new data becomes available. A much more efficient solution is to continually pre-train these models, saving significant compute compared to re-training. However, the distribution shift induced by new data typically results in degraded performance on previous data or poor adaptation to the new data. In this work, we show that a simple and scalable combination of learning rate (LR) re-warming, LR re-decaying, and replay of previous data is sufficient to match the performance of fully re-training from scratch on all available data, as measured by the final loss and the average score on several language model (LM) evaluation benchmarks. Specifically, we show this for a weak but realistic distribution shift between two commonly used LLM pre-training datasets (English$\rightarrow$English) and a stronger distribution shift (English$\rightarrow$German) at the $405$M parameter model scale with large dataset sizes (hundreds of billions of tokens). Selecting the weak but realistic shift for larger-scale experiments, we also find that our continual learning strategies match the re-training baseline for a 10B parameter LLM. Our results demonstrate that LLMs can be successfully updated via simple and scalable continual learning strategies, matching the re-training baseline using only a fraction of the compute. Finally, inspired by previous work, we propose alternatives to the cosine learning rate schedule that help circumvent forgetting induced by LR re-warming and that are not bound to a fixed token budget.
♻ ☆ LongRecipe: Recipe for Efficient Long Context Generalization in Large Language Models
Large language models (LLMs) face significant challenges in handling long-context tasks because of their limited effective context window size during pretraining, which restricts their ability to generalize over extended sequences. Meanwhile, extending the context window in LLMs through post-pretraining is highly resource-intensive. To address this, we introduce LongRecipe, an efficient training strategy for extending the context window of LLMs, including impactful token analysis, position index transformation, and training optimization strategies. It simulates long-sequence inputs while maintaining training efficiency and significantly improves the model's understanding of long-range dependencies. Experiments on three types of LLMs show that LongRecipe can utilize long sequences while requiring only 30% of the target context window size, and reduces computational training resource over 85% compared to full sequence training. Furthermore, LongRecipe also preserves the original LLM's capabilities in general tasks. Ultimately, we can extend the effective context window of open-source LLMs from 8k to 128k, achieving performance close to GPT-4 with just one day of dedicated training using a single GPU with 80G memory. Our code is released at https://github.com/zhiyuanhubj/LongRecipe.
comment: Work in Progress
♻ ☆ Revisiting Character-level Adversarial Attacks for Language Models ICML 2024
Adversarial attacks in Natural Language Processing apply perturbations in the character or token levels. Token-level attacks, gaining prominence for their use of gradient-based methods, are susceptible to altering sentence semantics, leading to invalid adversarial examples. While character-level attacks easily maintain semantics, they have received less attention as they cannot easily adopt popular gradient-based methods, and are thought to be easy to defend. Challenging these beliefs, we introduce Charmer, an efficient query-based adversarial attack capable of achieving high attack success rate (ASR) while generating highly similar adversarial examples. Our method successfully targets both small (BERT) and large (Llama 2) models. Specifically, on BERT with SST-2, Charmer improves the ASR in 4.84% points and the USE similarity in 8% points with respect to the previous art. Our implementation is available in https://github.com/LIONS-EPFL/Charmer.
comment: Accepted in ICML 2024
♻ ☆ LogicGame: Benchmarking Rule-Based Reasoning Abilities of Large Language Models
Large Language Models (LLMs) have demonstrated notable capabilities across various tasks, showcasing complex problem-solving abilities. Understanding and executing complex rules, along with multi-step planning, are fundamental to logical reasoning and critical for practical LLM agents and decision-making systems. However, evaluating LLMs as effective rule-based executors and planners remains underexplored. In this paper, we introduce LogicGame, a novel benchmark designed to evaluate the comprehensive rule understanding, execution, and planning capabilities of LLMs. Unlike traditional benchmarks, LogicGame provides diverse games that contain a series of rules with an initial state, requiring models to comprehend and apply predefined regulations to solve problems. We create simulated scenarios in which models execute or plan operations to achieve specific outcomes. These game scenarios are specifically designed to distinguish logical reasoning from mere knowledge by relying exclusively on predefined rules. This separation allows for a pure assessment of rule-based reasoning capabilities. The evaluation considers not only final outcomes but also intermediate steps, providing a comprehensive assessment of model performance. Moreover, these intermediate steps are deterministic and can be automatically verified. LogicGame defines game scenarios with varying difficulty levels, from simple rule applications to complex reasoning chains, in order to offer a precise evaluation of model performance on rule understanding and multi-step execution. Utilizing LogicGame, we test various LLMs and identify notable shortcomings in their rule-based logical reasoning abilities.
♻ ☆ AI-generated text boundary detection with RoFT
Due to the rapid development of large language models, people increasingly often encounter texts that may start as written by a human but continue as machine-generated. Detecting the boundary between human-written and machine-generated parts of such texts is a challenging problem that has not received much attention in literature. We attempt to bridge this gap and examine several ways to adapt state of the art artificial text detection classifiers to the boundary detection setting. We push all detectors to their limits, using the Real or Fake text benchmark that contains short texts on several topics and includes generations of various language models. We use this diversity to deeply examine the robustness of all detectors in cross-domain and cross-model settings to provide baselines and insights for future research. In particular, we find that perplexity-based approaches to boundary detection tend to be more robust to peculiarities of domain-specific data than supervised fine-tuning of the RoBERTa model; we also find which features of the text confuse boundary detection algorithms and negatively influence their performance in cross-domain settings.
comment: Our official repository: https://github.com/SilverSolver/ai_boundary_detection
♻ ☆ Negation Blindness in Large Language Models: Unveiling the NO Syndrome in Image Generation
Foundational Large Language Models (LLMs) have changed the way we perceive technology. They have been shown to excel in tasks ranging from poem writing and coding to essay generation and puzzle solving. With the incorporation of image generation capability, they have become more comprehensive and versatile AI tools. At the same time, researchers are striving to identify the limitations of these tools to improve them further. Currently identified flaws include hallucination, biases, and bypassing restricted commands to generate harmful content. In the present work, we have identified a fundamental limitation related to the image generation ability of LLMs, and termed it The NO Syndrome. This negation blindness refers to LLMs inability to correctly comprehend NO related natural language prompts to generate the desired images. Interestingly, all tested LLMs including GPT-4, Gemini, and Copilot were found to be suffering from this syndrome. To demonstrate the generalization of this limitation, we carried out simulation experiments and conducted entropy-based and benchmark statistical analysis tests on various LLMs in multiple languages, including English, Hindi, and French. We conclude that the NO syndrome is a significant flaw in current LLMs that needs to be addressed. A related finding of this study showed a consistent discrepancy between image and textual responses as a result of this NO syndrome. We posit that the introduction of a negation context-aware reinforcement learning based feedback loop between the LLMs textual response and generated image could help ensure the generated text is based on both the LLMs correct contextual understanding of the negation query and the generated visual output.
comment: 15 pages, 7 figures
♻ ☆ Seeing Like an AI: How LLMs Apply (and Misapply) Wikipedia Neutrality Norms
Large language models (LLMs) are trained on broad corpora and then used in communities with specialized norms. Is providing LLMs with community rules enough for models to follow these norms? We evaluate LLMs' capacity to detect (Task 1) and correct (Task 2) biased Wikipedia edits according to Wikipedia's Neutral Point of View (NPOV) policy. LLMs struggled with bias detection, achieving only 64% accuracy on a balanced dataset. Models exhibited contrasting biases (some under- and others over-predicted bias), suggesting distinct priors about neutrality. LLMs performed better at generation, removing 79% of words removed by Wikipedia editors. However, LLMs made additional changes beyond Wikipedia editors' simpler neutralizations, resulting in high-recall but low-precision editing. Interestingly, crowdworkers rated AI rewrites as more neutral (70%) and fluent (61%) than Wikipedia-editor rewrites. Qualitative analysis found LLMs sometimes applied NPOV more comprehensively than Wikipedia editors but often made extraneous non-NPOV-related changes (such as grammar). LLMs may apply rules in ways that resonate with the public but diverge from community experts. While potentially effective for generation, LLMs may reduce editor agency and increase moderation workload (e.g., verifying additions). Even when rules are easy to articulate, having LLMs apply them like community members may still be difficult.
♻ ☆ A Causal Explainable Guardrails for Large Language Models
Large Language Models (LLMs) have shown impressive performance in natural language tasks, but their outputs can exhibit undesirable attributes or biases. Existing methods for steering LLMs toward desired attributes often assume unbiased representations and rely solely on steering prompts. However, the representations learned from pre-training can introduce semantic biases that influence the steering process, leading to suboptimal results. We propose LLMGuardrail, a novel framework that incorporates causal analysis and adversarial learning to obtain unbiased steering representations in LLMs. LLMGuardrail systematically identifies and blocks the confounding effects of biases, enabling the extraction of unbiased steering representations. Additionally, it includes an explainable component that provides insights into the alignment between the generated output and the desired direction. Experiments demonstrate LLMGuardrail's effectiveness in steering LLMs toward desired attributes while mitigating biases. Our work contributes to the development of safe and reliable LLMs that align with desired attributes.
comment: 16 pages
♻ ☆ Parallel Speculative Decoding with Adaptive Draft Length
Speculative decoding (SD), where an extra draft model is employed to provide multiple \textit{draft} tokens first and then the original target model verifies these tokens in parallel, has shown great power for LLM inference acceleration. However, existing SD methods suffer from the mutual waiting problem, i.e., the target model gets stuck when the draft model is \textit{guessing} tokens, and vice versa. This problem is directly incurred by the asynchronous execution of the draft model and the target model, and is exacerbated due to the fixed draft length in speculative decoding. To address these challenges, we propose a conceptually simple, flexible, and general framework to boost speculative decoding, namely \textbf{P}arallel sp\textbf{E}culative decoding with \textbf{A}daptive d\textbf{R}aft \textbf{L}ength (PEARL). Specifically, PEARL proposes \textit{pre-verify} to verify the first draft token in advance during the drafting phase, and \textit{post-verify} to generate more draft tokens during the verification phase. PEARL parallels the drafting phase and the verification phase via applying the two strategies, and achieves adaptive draft length for different scenarios, which effectively alleviates the mutual waiting problem. Moreover, we theoretically demonstrate that the mean accepted tokens of PEARL is more than existing \textit{draft-then-verify} works. Experiments on various text generation benchmarks demonstrate the effectiveness of our \name, leading to a superior speedup performance up to \textbf{3.79$\times$} and \textbf{1.52$\times$}, compared to auto-regressive decoding and vanilla speculative decoding, respectively.
♻ ☆ HIRO: Hierarchical Information Retrieval Optimization
Retrieval-Augmented Generation (RAG) has revolutionized natural language processing by dynamically integrating external knowledge into Large Language Models (LLMs), addressing their limitation of static training datasets. Recent implementations of RAG leverage hierarchical data structures, which organize documents at various levels of summarization and information density. This complexity, however, can cause LLMs to "choke" on information overload, necessitating more sophisticated querying mechanisms. In this context, we introduce Hierarchical Information Retrieval Optimization (HIRO), a novel querying approach that employs a Depth-First Search (DFS)-based recursive similarity score calculation and branch pruning. This method uniquely minimizes the context delivered to the LLM without informational loss, effectively managing the challenge of excessive data. HIRO's refined approach is validated by a 10.85% improvement in performance on the NarrativeQA dataset.
♻ ☆ What Formal Languages Can Transformers Express? A Survey
As transformers have gained prominence in natural language processing, some researchers have investigated theoretically what problems they can and cannot solve, by treating problems as formal languages. Exploring such questions can help clarify the power of transformers relative to other models of computation, their fundamental capabilities and limits, and the impact of architectural choices. Work in this subarea has made considerable progress in recent years. Here, we undertake a comprehensive survey of this work, documenting the diverse assumptions that underlie different results and providing a unified framework for harmonizing seemingly contradictory findings.
comment: One minor correction in {\S}5.1
♻ ☆ Large Language Models for Information Retrieval: A Survey
As a primary means of information acquisition, information retrieval (IR) systems, such as search engines, have integrated themselves into our daily lives. These systems also serve as components of dialogue, question-answering, and recommender systems. The trajectory of IR has evolved dynamically from its origins in term-based methods to its integration with advanced neural models. While the neural models excel at capturing complex contextual signals and semantic nuances, thereby reshaping the IR landscape, they still face challenges such as data scarcity, interpretability, and the generation of contextually plausible yet potentially inaccurate responses. This evolution requires a combination of both traditional methods (such as term-based sparse retrieval methods with rapid response) and modern neural architectures (such as language models with powerful language understanding capacity). Meanwhile, the emergence of large language models (LLMs), typified by ChatGPT and GPT-4, has revolutionized natural language processing due to their remarkable language understanding, generation, generalization, and reasoning abilities. Consequently, recent research has sought to leverage LLMs to improve IR systems. Given the rapid evolution of this research trajectory, it is necessary to consolidate existing methodologies and provide nuanced insights through a comprehensive overview. In this survey, we delve into the confluence of LLMs and IR systems, including crucial aspects such as query rewriters, retrievers, rerankers, and readers. Additionally, we explore promising directions, such as search agents, within this expanding field.
comment: updated to version 3
♻ ☆ Towards a Universal Method for Meaningful Signal Detection
It is known that human speech and certain animal vocalizations can convey meaningful content because we can decipher the content that a given utterance does convey. This paper explores an alternative approach to determining whether a signal is meaningful, one that analyzes only the signal itself and is independent of what the conveyed meaning might be. We devise a method that takes a waveform as input and outputs a score indicating its degree of `meaningfulness`. We cluster contiguous portions of the input to minimize the total description length, and then take the length of the code of the assigned cluster labels as meaningfulness score. We evaluate our method empirically, against several baselines, and show that it is the only one to give a high score to human speech in various languages and with various speakers, a moderate score to animal vocalizations from birds and orcas, and a low score to ambient noise from various sources.
♻ ☆ Open Implementation and Study of BEST-RQ for Speech Processing ICASSP 2024
Self-Supervised Learning (SSL) has proven to be useful in various speech tasks. However, these methods are generally very demanding in terms of data, memory, and computational resources. BERT-based Speech pre-Training with Random-projection Quantizer (BEST-RQ), is an SSL method that has shown great performance on Automatic Speech Recognition (ASR) while being simpler than other SSL methods, such as wav2vec 2.0. Despite BEST-RQ's great performance, details are lacking in the original paper, such as the amount of GPU/TPU hours used in pre-training, and there is no official easy-to-use open-source implementation. Furthermore, BEST-RQ has not been evaluated on other downstream tasks aside from ASR and speech translation. In this work, we describe a re-implementation of a Random-projection quantizer and perform a preliminary study with a comparison to wav2vec 2.0 on four downstream tasks. We discuss the details and differences of our implementation. We show that a random projection quantizer can achieve similar downstream performance as wav2vec 2.0 while decreasing training time by over a factor of two.
comment: Accepted in IEEE ICASSP 2024 workshop on Self-supervision in Audio, Speech and Beyond (SASB 2024)
♻ ☆ Prompt Compression with Context-Aware Sentence Encoding for Fast and Improved LLM Inference
Large language models (LLMs) have triggered a new stream of research focusing on compressing the context length to reduce the computational cost while ensuring the retention of helpful information for LLMs to answer the given question. Token-based removal methods are one of the most prominent approaches in this direction, but risk losing the semantics of the context caused by intermediate token removal, especially under high compression ratios, while also facing challenges in computational efficiency. In this work, we propose context-aware prompt compression (CPC), a sentence-level prompt compression technique where its key innovation is a novel context-aware sentence encoder that provides a relevance score for each sentence for a given question. To train this encoder, we generate a new dataset consisting of questions, positives, and negative pairs where positives are sentences relevant to the question, while negatives are irrelevant context sentences. We train the encoder in a contrastive setup to learn context-aware sentence representations. Our method considerably outperforms prior works on prompt compression on benchmark datasets and is up to 10.93x faster at inference compared to the best token-level compression method. We also find better improvement for shorter length constraints in most benchmarks, showing the effectiveness of our proposed solution in the compression of relevant information in a shorter context. Finally, we release the code and the dataset for quick reproducibility and further development: https://github.com/Workday/cpc.
♻ ☆ CADGE: Context-Aware Dialogue Generation Enhanced with Graph-Structured Knowledge Aggregation
Commonsense knowledge is crucial to many natural language processing tasks. Existing works usually incorporate graph knowledge with conventional graph neural networks (GNNs), leading to the text and graph knowledge encoding processes being separated in a serial pipeline. We argue that these separate representation learning stages may be suboptimal for neural networks to learn the overall context contained in both types of input knowledge. In this paper, we propose a novel context-aware graph-attention model (Context-aware GAT), which can effectively incorporate global features of relevant knowledge graphs based on a context-enhanced knowledge aggregation process. Specifically, our framework leverages a novel representation learning approach to process heterogeneous features - combining flattened graph knowledge with text. To the best of our knowledge, this is the first attempt at hierarchically applying graph knowledge aggregation on a connected subgraph in addition to contextual information to support commonsense dialogue generation. This framework shows superior performance compared to conventional GNN-based language frameworks. Both automatic and human evaluation demonstrates that our proposed model has significant performance uplifts over state-of-the-art baselines.
comment: Accepted by INLG 2024
♻ ☆ Enhancing Sindhi Word Segmentation using Subword Representation Learning and Position-aware Self-attention
Sindhi word segmentation is a challenging task due to space omission and insertion issues. The Sindhi language itself adds to this complexity. It's cursive and consists of characters with inherent joining and non-joining properties, independent of word boundaries. Existing Sindhi word segmentation methods rely on designing and combining hand-crafted features. However, these methods have limitations, such as difficulty handling out-of-vocabulary words, limited robustness for other languages, and inefficiency with large amounts of noisy or raw text. Neural network-based models, in contrast, can automatically capture word boundary information without requiring prior knowledge. In this paper, we propose a Subword-Guided Neural Word Segmenter (SGNWS) that addresses word segmentation as a sequence labeling task. The SGNWS model incorporates subword representation learning through a bidirectional long short-term memory encoder, position-aware self-attention, and a conditional random field. Our empirical results demonstrate that the SGNWS model achieves state-of-the-art performance in Sindhi word segmentation on six datasets.
comment: Journal Paper, 14 pages
♻ ☆ A Sentence is Worth a Thousand Pictures: Can Large Language Models Understand Hum4n L4ngu4ge and the W0rld behind W0rds?
Modern Artificial Intelligence applications show great potential for language-related tasks that rely on next-word prediction. The current generation of Large Language Models (LLMs) have been linked to claims about human-like linguistic performance and their applications are hailed both as a step towards artificial general intelligence and as a major advance in understanding the cognitive, and even neural basis of human language. To assess these claims, first we analyze the contribution of LLMs as theoretically informative representations of a target cognitive system vs. atheoretical mechanistic tools. Second, we evaluate the models' ability to see the bigger picture, through top-down feedback from higher levels of processing, which requires grounding in previous expectations and past world experience. We hypothesize that since models lack grounded cognition, they cannot take advantage of these features and instead solely rely on fixed associations between represented words and word vectors. To assess this, we designed and ran a novel 'leet task' (l33t t4sk), which requires decoding sentences in which letters are systematically replaced by numbers. The results suggest that humans excel in this task whereas models struggle, confirming our hypothesis. We interpret the results by identifying the key abilities that are still missing from the current state of development of these models, which require solutions that go beyond increased system scaling.
♻ ☆ Exploring Interpretability of Independent Components of Word Embeddings with Automated Word Intruder Test LREC
Independent Component Analysis (ICA) is an algorithm originally developed for finding separate sources in a mixed signal, such as a recording of multiple people in the same room speaking at the same time. Unlike Principal Component Analysis (PCA), ICA permits the representation of a word as an unstructured set of features, without any particular feature being deemed more significant than the others. In this paper, we used ICA to analyze word embeddings. We have found that ICA can be used to find semantic features of the words, and these features can easily be combined to search for words that satisfy the combination. We show that most of the independent components represent such features. To quantify the interpretability of the components, we use the word intruder test, performed both by humans and by large language models. We propose to use the automated version of the word intruder test as a fast and inexpensive way of quantifying vector interpretability without the need for human effort.
comment: Presented at LREC-COLING 2024, cite this version please: https://aclanthology.org/2024.lrec-main.605/
♻ ☆ Vision-Language and Large Language Model Performance in Gastroenterology: GPT, Claude, Llama, Phi, Mistral, Gemma, and Quantized Models
Background and Aims: This study evaluates the medical reasoning performance of large language models (LLMs) and vision language models (VLMs) in gastroenterology. Methods: We used 300 gastroenterology board exam-style multiple-choice questions, 138 of which contain images to systematically assess the impact of model configurations and parameters and prompt engineering strategies utilizing GPT-3.5. Next, we assessed the performance of proprietary and open-source LLMs (versions), including GPT (3.5, 4, 4o, 4omini), Claude (3, 3.5), Gemini (1.0), Mistral, Llama (2, 3, 3.1), Mixtral, and Phi (3), across different interfaces (web and API), computing environments (cloud and local), and model precisions (with and without quantization). Finally, we assessed accuracy using a semiautomated pipeline. Results: Among the proprietary models, GPT-4o (73.7%) and Claude3.5-Sonnet (74.0%) achieved the highest accuracy, outperforming the top open-source models: Llama3.1-405b (64%), Llama3.1-70b (58.3%), and Mixtral-8x7b (54.3%). Among the quantized open-source models, the 6-bit quantized Phi3-14b (48.7%) performed best. The scores of the quantized models were comparable to those of the full-precision models Llama2-7b, Llama2--13b, and Gemma2-9b. Notably, VLM performance on image-containing questions did not improve when the images were provided and worsened when LLM-generated captions were provided. In contrast, a 10% increase in accuracy was observed when images were accompanied by human-crafted image descriptions. Conclusion: In conclusion, while LLMs exhibit robust zero-shot performance in medical reasoning, the integration of visual data remains a challenge for VLMs. Effective deployment involves carefully determining optimal model configurations, encouraging users to consider either the high performance of proprietary models or the flexible adaptability of open-source models.
comment: Manuscript Pages: 34, Figures: 7, Tables: 2, Supplementary File Pages: 35, Data Transparency Statement: Code is available at: https://github.com/Sdamirsa/LLM-VLM-in-Gastroenterology . Study data from American College of Gastroenterology (ACG) are restricted and available upon request with ACG permission. Correction: updated abstract considering Llama3.1 results
♻ ☆ Towards Measuring and Modeling "Culture" in LLMs: A Survey
We present a survey of more than 90 recent papers that aim to study cultural representation and inclusion in large language models (LLMs). We observe that none of the studies explicitly define "culture, which is a complex, multifaceted concept; instead, they probe the models on some specially designed datasets which represent certain aspects of "culture". We call these aspects the proxies of culture, and organize them across two dimensions of demographic and semantic proxies. We also categorize the probing methods employed. Our analysis indicates that only certain aspects of ``culture,'' such as values and objectives, have been studied, leaving several other interesting and important facets, especially the multitude of semantic domains (Thompson et al., 2020) and aboutness (Hershcovich et al., 2022), unexplored. Two other crucial gaps are the lack of robustness of probing techniques and situated studies on the impact of cultural mis- and under-representation in LLM-based applications.
♻ ☆ Jina-ColBERT-v2: A General-Purpose Multilingual Late Interaction Retriever EMNLP
Multi-vector dense models, such as ColBERT, have proven highly effective in information retrieval. ColBERT's late interaction scoring approximates the joint query-document attention seen in cross-encoders while maintaining inference efficiency closer to traditional dense retrieval models, thanks to its bi-encoder architecture and recent optimizations in indexing and search. In this paper, we introduce a novel architecture and a training framework to support long context window and multilingual retrieval. Our new model, Jina-ColBERT-v2, demonstrates strong performance across a range of English and multilingual retrieval tasks,
comment: 8 pages, references at pp7,8; EMNLP workshop submission
♻ ☆ An Empirical Study on Information Extraction using Large Language Models
Human-like large language models (LLMs), especially the most powerful and popular ones in OpenAI's GPT family, have proven to be very helpful for many natural language processing (NLP) related tasks. Therefore, various attempts have been made to apply LLMs to information extraction (IE), which is a fundamental NLP task that involves extracting information from unstructured plain text. To demonstrate the latest representative progress in LLMs' information extraction ability, we assess the information extraction ability of GPT-4 (the latest version of GPT at the time of writing this paper) from four perspectives: Performance, Evaluation Criteria, Robustness, and Error Types. Our results suggest a visible performance gap between GPT-4 and state-of-the-art (SOTA) IE methods. To alleviate this problem, considering the LLMs' human-like characteristics, we propose and analyze the effects of a series of simple prompt-based methods, which can be generalized to other LLMs and NLP tasks. Rich experiments show our methods' effectiveness and some of their remaining issues in improving GPT-4's information extraction ability.
comment: This article has an original arxiv version entitled "Is Information Extraction Solved by ChatGPT? An Analysis of Performance, Evaluation Criteria, Robustness and Errors", whose url link is arXiv/2305.14450
♻ ☆ Can AI Replace Human Subjects? A Large-Scale Replication of Psychological Experiments with LLMs
Artificial Intelligence (AI) is increasingly being integrated into scientific research, particularly in the social sciences, where understanding human behavior is critical. Large Language Models (LLMs) like GPT-4 have shown promise in replicating human-like responses in various psychological experiments. However, the extent to which LLMs can effectively replace human subjects across diverse experimental contexts remains unclear. Here, we conduct a large-scale study replicating 154 psychological experiments from top social science journals with 618 main effects and 138 interaction effects using GPT-4 as a simulated participant. We find that GPT-4 successfully replicates 76.0 percent of main effects and 47.0 percent of interaction effects observed in the original studies, closely mirroring human responses in both direction and significance. However, only 19.44 percent of GPT-4's replicated confidence intervals contain the original effect sizes, with the majority of replicated effect sizes exceeding the 95 percent confidence interval of the original studies. Additionally, there is a 71.6 percent rate of unexpected significant results where the original studies reported null findings, suggesting potential overestimation or false positives. Our results demonstrate the potential of LLMs as powerful tools in psychological research but also emphasize the need for caution in interpreting AI-driven findings. While LLMs can complement human studies, they cannot yet fully replace the nuanced insights provided by human subjects.
comment: 5 figures, 2 tables
♻ ☆ Enhancing Dialogue Generation in Werewolf Game Through Situation Analysis and Persuasion Strategies
Recent advancements in natural language processing, particularly with large language models (LLMs) like GPT-4, have significantly enhanced dialogue systems, enabling them to generate more natural and fluent conversations. Despite these improvements, challenges persist, such as managing continuous dialogues, memory retention, and minimizing hallucinations. The AIWolfDial2024 addresses these challenges by employing the Werewolf Game, an incomplete information game, to test the capabilities of LLMs in complex interactive environments. This paper introduces a LLM-based Werewolf Game AI, where each role is supported by situation analysis to aid response generation. Additionally, for the werewolf role, various persuasion strategies, including logical appeal, credibility appeal, and emotional appeal, are employed to effectively persuade other players to align with its actions.
comment: Accepted to the AIWolfDial2024 workshop at INLG 2024
♻ ☆ SELF-[IN]CORRECT: LLMs Struggle with Discriminating Self-Generated Responses
Can LLMs consistently improve their previous outputs for better results? For this to be true, LLMs would need to be better at discriminating among previously-generated alternatives, than generating initial responses. We explore the validity of this hypothesis in practice. We first formulate a unified framework that allows us to compare the generative and discriminative capability of any model on any task. In our resulting experimental analysis of several open-source and industrial LLMs, we observe that models are not reliably better at discriminating among previously-generated alternatives than generating initial responses. This finding challenges the notion that LLMs may be able to enhance their performance only through their own judgment.
♻ ☆ LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet
Recent large language model (LLM) defenses have greatly improved models' ability to refuse harmful queries, even when adversarially attacked. However, LLM defenses are primarily evaluated against automated adversarial attacks in a single turn of conversation, an insufficient threat model for real-world malicious use. We demonstrate that multi-turn human jailbreaks uncover significant vulnerabilities, exceeding 70% attack success rate (ASR) on HarmBench against defenses that report single-digit ASRs with automated single-turn attacks. Human jailbreaks also reveal vulnerabilities in machine unlearning defenses, successfully recovering dual-use biosecurity knowledge from unlearned models. We compile these results into Multi-Turn Human Jailbreaks (MHJ), a dataset of 2,912 prompts across 537 multi-turn jailbreaks. We publicly release MHJ alongside a compendium of jailbreak tactics developed across dozens of commercial red teaming engagements, supporting research towards stronger LLM defenses.
♻ ☆ Anchored Preference Optimization and Contrastive Revisions: Addressing Underspecification in Alignment
Large Language Models (LLMs) are often aligned using contrastive alignment objectives and preference pair datasets. The interaction between model, paired data, and objective makes alignment a complicated procedure, sometimes producing subpar results. We study this and find that (i) preference data gives a better learning signal when the underlying responses are contrastive, and (ii) alignment objectives lead to better performance when they specify more control over the model during training. Based on these insights, we introduce Contrastive Learning from AI Revisions (CLAIR), a data-creation method which leads to more contrastive preference pairs, and Anchored Preference Optimization (APO), a controllable and more stable alignment objective. We align Llama-3-8B-Instruct using various comparable datasets and alignment objectives and measure MixEval-Hard scores, which correlate highly with human judgments. The CLAIR preferences lead to the strongest performance out of all datasets, and APO consistently outperforms less controllable objectives. Our best model, trained on 32K CLAIR preferences with APO, improves Llama-3-8B-Instruct by 7.65%, closing the gap with GPT4-turbo by 45%. Our code is available at https://github.com/ContextualAI/CLAIR_and_APO.
Computer Vision and Pattern Recognition
☆ HiPrompt: Tuning-free Higher-Resolution Generation with Hierarchical MLLM Prompts
The potential for higher-resolution image generation using pretrained diffusion models is immense, yet these models often struggle with issues of object repetition and structural artifacts especially when scaling to 4K resolution and higher. We figure out that the problem is caused by that, a single prompt for the generation of multiple scales provides insufficient efficacy. In response, we propose HiPrompt, a new tuning-free solution that tackles the above problems by introducing hierarchical prompts. The hierarchical prompts offer both global and local guidance. Specifically, the global guidance comes from the user input that describes the overall content, while the local guidance utilizes patch-wise descriptions from MLLMs to elaborately guide the regional structure and texture generation. Furthermore, during the inverse denoising process, the generated noise is decomposed into low- and high-frequency spatial components. These components are conditioned on multiple prompt levels, including detailed patch-wise descriptions and broader image-level prompts, facilitating prompt-guided denoising under hierarchical semantic guidance. It further allows the generation to focus more on local spatial regions and ensures the generated images maintain coherent local and global semantics, structures, and textures with high definition. Extensive experiments demonstrate that HiPrompt outperforms state-of-the-art works in higher-resolution image generation, significantly reducing object repetition and enhancing structural quality.
☆ UC-NeRF: Uncertainty-aware Conditional Neural Radiance Fields from Endoscopic Sparse Views
Visualizing surgical scenes is crucial for revealing internal anatomical structures during minimally invasive procedures. Novel View Synthesis is a vital technique that offers geometry and appearance reconstruction, enhancing understanding, planning, and decision-making in surgical scenes. Despite the impressive achievements of Neural Radiance Field (NeRF), its direct application to surgical scenes produces unsatisfying results due to two challenges: endoscopic sparse views and significant photometric inconsistencies. In this paper, we propose uncertainty-aware conditional NeRF for novel view synthesis to tackle the severe shape-radiance ambiguity from sparse surgical views. The core of UC-NeRF is to incorporate the multi-view uncertainty estimation to condition the neural radiance field for modeling the severe photometric inconsistencies adaptively. Specifically, our UC-NeRF first builds a consistency learner in the form of multi-view stereo network, to establish the geometric correspondence from sparse views and generate uncertainty estimation and feature priors. In neural rendering, we design a base-adaptive NeRF network to exploit the uncertainty estimation for explicitly handling the photometric inconsistencies. Furthermore, an uncertainty-guided geometry distillation is employed to enhance geometry learning. Experiments on the SCARED and Hamlyn datasets demonstrate our superior performance in rendering appearance and geometry, consistently outperforming the current state-of-the-art approaches. Our code will be released at \url{https://github.com/wrld/UC-NeRF}.
☆ Can LVLMs Obtain a Driver's License? A Benchmark Towards Reliable AGI for Autonomous Driving
Large Vision-Language Models (LVLMs) have recently garnered significant attention, with many efforts aimed at harnessing their general knowledge to enhance the interpretability and robustness of autonomous driving models. However, LVLMs typically rely on large, general-purpose datasets and lack the specialized expertise required for professional and safe driving. Existing vision-language driving datasets focus primarily on scene understanding and decision-making, without providing explicit guidance on traffic rules and driving skills, which are critical aspects directly related to driving safety. To bridge this gap, we propose IDKB, a large-scale dataset containing over one million data items collected from various countries, including driving handbooks, theory test data, and simulated road test data. Much like the process of obtaining a driver's license, IDKB encompasses nearly all the explicit knowledge needed for driving from theory to practice. In particular, we conducted comprehensive tests on 15 LVLMs using IDKB to assess their reliability in the context of autonomous driving and provided extensive analysis. We also fine-tuned popular models, achieving notable performance improvements, which further validate the significance of our dataset. The project page can be found at: \url{https://4dvlab.github.io/project_page/idkb.html}
☆ SITAR: Semi-supervised Image Transformer for Action Recognition ICPR 2024
Recognizing actions from a limited set of labeled videos remains a challenge as annotating visual data is not only tedious but also can be expensive due to classified nature. Moreover, handling spatio-temporal data using deep $3$D transformers for this can introduce significant computational complexity. In this paper, our objective is to address video action recognition in a semi-supervised setting by leveraging only a handful of labeled videos along with a collection of unlabeled videos in a compute efficient manner. Specifically, we rearrange multiple frames from the input videos in row-column form to construct super images. Subsequently, we capitalize on the vast pool of unlabeled samples and employ contrastive learning on the encoded super images. Our proposed approach employs two pathways to generate representations for temporally augmented super images originating from the same video. Specifically, we utilize a 2D image-transformer to generate representations and apply a contrastive loss function to minimize the similarity between representations from different videos while maximizing the representations of identical videos. Our method demonstrates superior performance compared to existing state-of-the-art approaches for semi-supervised action recognition across various benchmark datasets, all while significantly reducing computational costs.
comment: Accepted at ICPR 2024
☆ LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture
Expanding the long-context capabilities of Multi-modal Large Language Models~(MLLMs) is crucial for video understanding, high-resolution image understanding, and multi-modal agents. This involves a series of systematic optimizations, including model architecture, data construction and training strategy, particularly addressing challenges such as \textit{degraded performance with more images} and \textit{high computational costs}. In this paper, we adapt the model architecture to a hybrid of Mamba and Transformer blocks, approach data construction with both temporal and spatial dependencies among multiple images and employ a progressive training strategy. The released model \textbf{LongLLaVA}~(\textbf{Long}-Context \textbf{L}arge \textbf{L}anguage \textbf{a}nd \textbf{V}ision \textbf{A}ssistant) is the first hybrid MLLM, which achieved a better balance between efficiency and effectiveness. LongLLaVA not only achieves competitive results across various benchmarks, but also maintains high throughput and low memory consumption. Especially, it could process nearly a thousand images on a single A100 80GB GPU, showing promising application prospects for a wide range of tasks.
comment: 19 pages, 7 figures, 6 tables
☆ CanvOI, an Oncology Intelligence Foundation Model: Scaling FLOPS Differently
The rapidly evolving field of digital oncopathology faces significant challenges, including the need to address diverse and complex clinical questions, often involving rare conditions, with limited availability of labeled data. These limitations hinder the development of robust AI-driven tools in the biomedical space, where accuracy in probabilistic determinations is of utmost importance. To address this, digital pathology foundation models have begun to emerge, typically developed with the size and diversity of the pre-training dataset and model parameters in mind. Here, we present CanvOI, a ViT-g/10-based foundation model designed to enhance the capabilities of digital pathology by addressing these challenges through a different approach. Considering the unique nature of oncologic histopathological images and the requirements from the embeddings to provide meaningful representations for Multiple Instance Learning (MIL) downstream models, we chose to modify the input image characteristics. By introducing larger tile sizes (380 x 380 pixels) and smaller patch sizes (10 x 10 pixels), we were able to optimize the model's performance, pushing computational resources in a new direction and achieving state-of-the-art performance on cancer-related benchmarks. CanvOI demonstrated a 1.5-7.4% improvement in averaged AUC compared to other leading foundation models built for digital pathology. Moreover, our results demonstrate that CanvOI significantly outperformed the other models, with the performance gap widening substantially when trained on just 10% of the initial cohort. This work highlights an alternative approach that, if integrated with traditional development approaches, has the potential to advance Oncology Intelligence (OI), overcome some of the current barriers and ultimately improve the clinical outcome of cancer patients.
comment: 12 pages, 5 figures
☆ Multi-stream deep learning framework to predict mild cognitive impairment with Rey Complex Figure Test
Drawing tests like the Rey Complex Figure Test (RCFT) are widely used to assess cognitive functions such as visuospatial skills and memory, making them valuable tools for detecting mild cognitive impairment (MCI). Despite their utility, existing predictive models based on these tests often suffer from limitations like small sample sizes and lack of external validation, which undermine their reliability. We developed a multi-stream deep learning framework that integrates two distinct processing streams: a multi-head self-attention based spatial stream using raw RCFT images and a scoring stream employing a previously developed automated scoring system. Our model was trained on data from 1,740 subjects in the Korean cohort and validated on an external hospital dataset of 222 subjects from Korea. The proposed multi-stream model demonstrated superior performance over baseline models (AUC = 0.872, Accuracy = 0.781) in external validation. The integration of both spatial and scoring streams enables the model to capture intricate visual details from the raw images while also incorporating structured scoring data, which together enhance its ability to detect subtle cognitive impairments. This dual approach not only improves predictive accuracy but also increases the robustness of the model, making it more reliable in diverse clinical settings. Our model has practical implications for clinical settings, where it could serve as a cost-effective tool for early MCI screening.
comment: 20 pages, 3 figures, 2 tables
☆ Benchmarking Spurious Bias in Few-Shot Image Classifiers ECCV 2024
Few-shot image classifiers are designed to recognize and classify new data with minimal supervision and limited data but often show reliance on spurious correlations between classes and spurious attributes, known as spurious bias. Spurious correlations commonly hold in certain samples and few-shot classifiers can suffer from spurious bias induced from them. There is an absence of an automatic benchmarking system to assess the robustness of few-shot classifiers against spurious bias. In this paper, we propose a systematic and rigorous benchmark framework, termed FewSTAB, to fairly demonstrate and quantify varied degrees of robustness of few-shot classifiers to spurious bias. FewSTAB creates few-shot evaluation tasks with biased attributes so that using them for predictions can demonstrate poor performance. To construct these tasks, we propose attribute-based sample selection strategies based on a pre-trained vision-language model, eliminating the need for manual dataset curation. This allows FewSTAB to automatically benchmark spurious bias using any existing test data. FewSTAB offers evaluation results in a new dimension along with a new design guideline for building robust classifiers. Moreover, it can benchmark spurious bias in varied degrees and enable designs for varied degrees of robustness. Its effectiveness is demonstrated through experiments on ten few-shot learning methods across three datasets. We hope our framework can inspire new designs of robust few-shot classifiers. Our code is available at https://github.com/gtzheng/FewSTAB.
comment: Accepted to ECCV 2024
☆ The Impact of Balancing Real and Synthetic Data on Accuracy and Fairness in Face Recognition ECCV 2024
Over the recent years, the advancements in deep face recognition have fueled an increasing demand for large and diverse datasets. Nevertheless, the authentic data acquired to create those datasets is typically sourced from the web, which, in many cases, can lead to significant privacy issues due to the lack of explicit user consent. Furthermore, obtaining a demographically balanced, large dataset is even more difficult because of the natural imbalance in the distribution of images from different demographic groups. In this paper, we investigate the impact of demographically balanced authentic and synthetic data, both individually and in combination, on the accuracy and fairness of face recognition models. Initially, several generative methods were used to balance the demographic representations of the corresponding synthetic datasets. Then a state-of-the-art face encoder was trained and evaluated using (combinations of) synthetic and authentic images. Our findings emphasized two main points: (i) the increased effectiveness of training data generated by diffusion-based models in enhancing accuracy, whether used alone or combined with subsets of authentic data, and (ii) the minimal impact of incorporating balanced data from pre-trained generative methods on fairness (in nearly all tested scenarios using combined datasets, fairness scores remained either unchanged or worsened, even when compared to unbalanced authentic datasets). Source code and data are available at \url{https://cutt.ly/AeQy1K5G} for reproducibility.
comment: Accepted at Synthetic Data for Computer Vision Workshop - Side Event at ECCV 2024
☆ Hybrid-Segmentor: A Hybrid Approach to Automated Fine-Grained Crack Segmentation in Civil Infrastructure
Detecting and segmenting cracks in infrastructure, such as roads and buildings, is crucial for safety and cost-effective maintenance. In spite of the potential of deep learning, there are challenges in achieving precise results and handling diverse crack types. With the proposed dataset and model, we aim to enhance crack detection and infrastructure maintenance. We introduce Hybrid-Segmentor, an encoder-decoder based approach that is capable of extracting both fine-grained local and global crack features. This allows the model to improve its generalization capabilities in distinguish various type of shapes, surfaces and sizes of cracks. To keep the computational performances low for practical purposes, while maintaining the high the generalization capabilities of the model, we incorporate a self-attention model at the encoder level, while reducing the complexity of the decoder component. The proposed model outperforms existing benchmark models across 5 quantitative metrics (accuracy 0.971, precision 0.804, recall 0.744, F1-score 0.770, and IoU score 0.630), achieving state-of-the-art status.
comment: 25 pages, 6 figures
☆ Human-VDM: Learning Single-Image 3D Human Gaussian Splatting from Video Diffusion Models
Generating lifelike 3D humans from a single RGB image remains a challenging task in computer vision, as it requires accurate modeling of geometry, high-quality texture, and plausible unseen parts. Existing methods typically use multi-view diffusion models for 3D generation, but they often face inconsistent view issues, which hinder high-quality 3D human generation. To address this, we propose Human-VDM, a novel method for generating 3D human from a single RGB image using Video Diffusion Models. Human-VDM provides temporally consistent views for 3D human generation using Gaussian Splatting. It consists of three modules: a view-consistent human video diffusion module, a video augmentation module, and a Gaussian Splatting module. First, a single image is fed into a human video diffusion module to generate a coherent human video. Next, the video augmentation module applies super-resolution and video interpolation to enhance the textures and geometric smoothness of the generated video. Finally, the 3D Human Gaussian Splatting module learns lifelike humans under the guidance of these high-resolution and view-consistent images. Experiments demonstrate that Human-VDM achieves high-quality 3D human from a single image, outperforming state-of-the-art methods in both generation quality and quantity. Project page: https://human-vdm.github.io/Human-VDM/
comment: 14 Pages, 8 figures, Project page: https://human-vdm.github.io/Human-VDM/
☆ MaDis-Stereo: Enhanced Stereo Matching via Distilled Masked Image Modeling
In stereo matching, CNNs have traditionally served as the predominant architectures. Although Transformer-based stereo models have been studied recently, their performance still lags behind CNN-based stereo models due to the inherent data scarcity issue in the stereo matching task. In this paper, we propose Masked Image Modeling Distilled Stereo matching model, termed MaDis-Stereo, that enhances locality inductive bias by leveraging Masked Image Modeling (MIM) in training Transformer-based stereo model. Given randomly masked stereo images as inputs, our method attempts to conduct both image reconstruction and depth prediction tasks. While this strategy is beneficial to resolving the data scarcity issue, the dual challenge of reconstructing masked tokens and subsequently performing stereo matching poses significant challenges, particularly in terms of training stability. To address this, we propose to use an auxiliary network (teacher), updated via Exponential Moving Average (EMA), along with the original stereo model (student), where teacher predictions serve as pseudo supervisory signals to effectively distill knowledge into the student model. State-of-the-arts performance is achieved with the proposed method on several stereo matching such as ETH3D and KITTI 2015. Additionally, to demonstrate that our model effectively leverages locality inductive bias, we provide the attention distance measurement.
☆ iConFormer: Dynamic Parameter-Efficient Tuning with Input-Conditioned Adaptation
Transfer learning based on full fine-tuning (FFT) of the pre-trained encoder and task-specific decoder becomes increasingly complex as deep models grow exponentially. Parameter efficient fine-tuning (PEFT) approaches using adapters consisting of small learnable layers have emerged as an alternative to FFT, achieving comparable performance while maintaining high training efficiency. However, the inflexibility of the adapter with respect to input instances limits its capability of learning task-specific information in diverse downstream tasks. In this paper, we propose a novel PEFT approach, input-Conditioned transFormer, termed iConFormer, that leverages a dynamic adapter conditioned on the input instances. To secure flexible learning ability on input instances in various downstream tasks, we introduce an input-Conditioned Network (iCoN) in the dynamic adapter that enables instance-level feature transformation. To be specific, iCoN generates channel-wise convolutional kernels for each feature and transform it using adaptive convolution process to effectively capture task-specific and fine-grained details tailor to downstream tasks. Experimental results demonstrate that by tuning just 1.6% to 2.8% of the Transformer backbone parameters, iConFormer achieves performance comparable to FFT in monocular depth estimation and semantic segmentation, while outperforming it in image classification and instance segmentation. Also, the proposed method consistently outperforms recent PEFT methods for all the tasks mentioned above.
☆ ExpLLM: Towards Chain of Thought for Facial Expression Recognition
Facial expression recognition (FER) is a critical task in multimedia with significant implications across various domains. However, analyzing the causes of facial expressions is essential for accurately recognizing them. Current approaches, such as those based on facial action units (AUs), typically provide AU names and intensities but lack insight into the interactions and relationships between AUs and the overall expression. In this paper, we propose a novel method called ExpLLM, which leverages large language models to generate an accurate chain of thought (CoT) for facial expression recognition. Specifically, we have designed the CoT mechanism from three key perspectives: key observations, overall emotional interpretation, and conclusion. The key observations describe the AU's name, intensity, and associated emotions. The overall emotional interpretation provides an analysis based on multiple AUs and their interactions, identifying the dominant emotions and their relationships. Finally, the conclusion presents the final expression label derived from the preceding analysis. Furthermore, we also introduce the Exp-CoT Engine, designed to construct this expression CoT and generate instruction-description data for training our ExpLLM. Extensive experiments on the RAF-DB and AffectNet datasets demonstrate that ExpLLM outperforms current state-of-the-art FER methods. ExpLLM also surpasses the latest GPT-4o in expression CoT generation, particularly in recognizing micro-expressions where GPT-4o frequently fails.
comment: project page: https://starhiking.github.io/ExpLLM_Page/
☆ Automatic facial axes standardization of 3D fetal ultrasound images
Craniofacial anomalies indicate early developmental disturbances and are usually linked to many genetic syndromes. Early diagnosis is critical, yet ultrasound (US) examinations often fail to identify these features. This study presents an AI-driven tool to assist clinicians in standardizing fetal facial axes/planes in 3D US, reducing sonographer workload and facilitating the facial evaluation. Our network, structured into three blocks-feature extractor, rotation and translation regression, and spatial transformer-processes three orthogonal 2D slices to estimate the necessary transformations for standardizing the facial planes in the 3D US. These transformations are applied to the original 3D US using a differentiable module (the spatial transformer block), yielding a standardized 3D US and the corresponding 2D facial standard planes. The dataset used consists of 1180 fetal facial 3D US images acquired between weeks 20 and 35 of gestation. Results show that our network considerably reduces inter-observer rotation variability in the test set, with a mean geodesic angle difference of 14.12$^{\circ}$ $\pm$ 18.27$^{\circ}$ and an Euclidean angle error of 7.45$^{\circ}$ $\pm$ 14.88$^{\circ}$. These findings demonstrate the network's ability to effectively standardize facial axes, crucial for consistent fetal facial assessments. In conclusion, the proposed network demonstrates potential for improving the consistency and accuracy of fetal facial assessments in clinical settings, facilitating early evaluation of craniofacial anomalies.
☆ Deep Learning Meets Satellite Images -- An Evaluation on Handcrafted and Learning-based Features for Multi-date Satellite Stereo Images ECCV2024
A critical step in the digital surface models(DSM) generation is feature matching. Off-track (or multi-date) satellite stereo images, in particular, can challenge the performance of feature matching due to spectral distortions between images, long baseline, and wide intersection angles. Feature matching methods have evolved over the years from handcrafted methods (e.g., SIFT) to learning-based methods (e.g., SuperPoint and SuperGlue). In this paper, we compare the performance of different features, also known as feature extraction and matching methods, applied to satellite imagery. A wide range of stereo pairs(~500) covering two separate study sites are used. SIFT, as a widely used classic feature extraction and matching algorithm, is compared with seven deep-learning matching methods: SuperGlue, LightGlue, LoFTR, ASpanFormer, DKM, GIM-LightGlue, and GIM-DKM. Results demonstrate that traditional matching methods are still competitive in this age of deep learning, although for particular scenarios learning-based methods are very promising.
comment: ECCV2024 Workshop - TradiCV
☆ MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark
This paper introduces MMMU-Pro, a robust version of the Massive Multi-discipline Multimodal Understanding and Reasoning (MMMU) benchmark. MMMU-Pro rigorously assesses multimodal models' true understanding and reasoning capabilities through a three-step process based on MMMU: (1) filtering out questions answerable by text-only models, (2) augmenting candidate options, and (3) introducing a vision-only input setting where questions are embedded within images. This setting challenges AI to truly "see" and "read" simultaneously, testing a fundamental human cognitive skill of seamlessly integrating visual and textual information. Results show that model performance is substantially lower on MMMU-Pro than on MMMU, ranging from 16.8% to 26.9% across models. We explore the impact of OCR prompts and Chain of Thought (CoT) reasoning, finding that OCR prompts have minimal effect while CoT generally improves performance. MMMU-Pro provides a more rigorous evaluation tool, closely mimicking real-world scenarios and offering valuable directions for future research in multimodal AI.
☆ UnLearning from Experience to Avoid Spurious Correlations
While deep neural networks can achieve state-of-the-art performance in many tasks, these models are more fragile than they appear. They are prone to learning spurious correlations in their training data, leading to surprising failure cases. In this paper, we propose a new approach that addresses the issue of spurious correlations: UnLearning from Experience (ULE). Our method is based on using two classification models trained in parallel: student and teacher models. Both models receive the same batches of training data. The student model is trained with no constraints and pursues the spurious correlations in the data. The teacher model is trained to solve the same classification problem while avoiding the mistakes of the student model. As training is done in parallel, the better the student model learns the spurious correlations, the more robust the teacher model becomes. The teacher model uses the gradient of the student's output with respect to its input to unlearn mistakes made by the student. We show that our method is effective on the Waterbirds, CelebA, Spawrious and UrbanCars datasets.
comment: 10 pages
☆ Validation of musculoskeletal segmentation model with uncertainty estimation for bone and muscle assessment in hip-to-knee clinical CT images
Deep learning-based image segmentation has allowed for the fully automated, accurate, and rapid analysis of musculoskeletal (MSK) structures from medical images. However, current approaches were either applied only to 2D cross-sectional images, addressed few structures, or were validated on small datasets, which limit the application in large-scale databases. This study aimed to validate an improved deep learning model for volumetric MSK segmentation of the hip and thigh with uncertainty estimation from clinical computed tomography (CT) images. Databases of CT images from multiple manufacturers/scanners, disease status, and patient positioning were used. The segmentation accuracy, and accuracy in estimating the structures volume and density, i.e., mean HU, were evaluated. An approach for segmentation failure detection based on predictive uncertainty was also investigated. The model has shown an overall improvement with respect to all segmentation accuracy and structure volume/density evaluation metrics. The predictive uncertainty yielded large areas under the receiver operating characteristic (AUROC) curves (AUROCs>=.95) in detecting inaccurate and failed segmentations. The high segmentation and muscle volume/density estimation accuracy, along with the high accuracy in failure detection based on the predictive uncertainty, exhibited the model's reliability for analyzing individual MSK structures in large-scale CT databases.
comment: 29 pages, 7+10supp figures, 8 tables
☆ CLDA: Collaborative Learning for Enhanced Unsupervised Domain Adaptation
Unsupervised Domain Adaptation (UDA) endeavors to bridge the gap between a model trained on a labeled source domain and its deployment in an unlabeled target domain. However, current high-performance models demand significant resources, resulting in prohibitive deployment costs and highlighting the need for small yet effective models. For UDA of lightweight models, Knowledge Distillation (KD) in a Teacher-Student framework can be a common approach, but we find that domain shift in UDA leads to a significant increase in non-salient parameters in the teacher model, degrading model's generalization ability and transferring misleading information to the student model. Interestingly, we observed that this phenomenon occurs considerably less in the student model. Driven by this insight, we introduce Collaborative Learning, a method that updates the teacher's non-salient parameters using the student model and at the same time enhance the student's performance using the updated teacher model. Experiments across various tasks and datasets show consistent performance improvements for both student and teacher models. For example, in semantic segmentation, CLDA achieves an improvement of +0.7% mIoU for teacher and +1.4% mIoU for student compared to the baseline model in the GTA to Cityscapes. In the Synthia to Cityscapes, it achieves an improvement of +0.8% mIoU for teacher and +2.0% mIoU for student.
☆ Rethinking HTG Evaluation: Bridging Generation and Recognition
The evaluation of generative models for natural image tasks has been extensively studied. Similar protocols and metrics are used in cases with unique particularities, such as Handwriting Generation, even if they might not be completely appropriate. In this work, we introduce three measures tailored for HTG evaluation, $ \text{HTG}_{\text{HTR}} $, $ \text{HTG}_{\text{style}} $, and $ \text{HTG}_{\text{OOV}} $, and argue that they are more expedient to evaluate the quality of generated handwritten images. The metrics rely on the recognition error/accuracy of Handwriting Text Recognition and Writer Identification models and emphasize writing style, textual content, and diversity as the main aspects that adhere to the content of handwritten images. We conduct comprehensive experiments on the IAM handwriting database, showcasing that widely used metrics such as FID fail to properly quantify the diversity and the practical utility of generated handwriting samples. Our findings show that our metrics are richer in information and underscore the necessity of standardized evaluation protocols in HTG. The proposed metrics provide a more robust and informative protocol for assessing HTG quality, contributing to improved performance in HTR. Code for the evaluation protocol is available at: https://github.com/koninik/HTG_evaluation.
☆ Improved Single Camera BEV Perception Using Multi-Camera Training SC 2024
Bird's Eye View (BEV) map prediction is essential for downstream autonomous driving tasks like trajectory prediction. In the past, this was accomplished through the use of a sophisticated sensor configuration that captured a surround view from multiple cameras. However, in large-scale production, cost efficiency is an optimization goal, so that using fewer cameras becomes more relevant. But the consequence of fewer input images correlates with a performance drop. This raises the problem of developing a BEV perception model that provides a sufficient performance on a low-cost sensor setup. Although, primarily relevant for inference time on production cars, this cost restriction is less problematic on a test vehicle during training. Therefore, the objective of our approach is to reduce the aforementioned performance drop as much as possible using a modern multi-camera surround view model reduced for single-camera inference. The approach includes three features, a modern masking technique, a cyclic Learning Rate (LR) schedule, and a feature reconstruction loss for supervising the transition from six-camera inputs to one-camera input during training. Our method outperforms versions trained strictly with one camera or strictly with six-camera surround view for single-camera inference resulting in reduced hallucination and better quality of the BEV map.
comment: This Paper has been accepted to the 27th IEEE International Conference on Intelligent Transportation Systems (ITSC 2024)
☆ Multi-Head Attention Residual Unfolded Network for Model-Based Pansharpening
The objective of pansharpening and hypersharpening is to accurately combine a high-resolution panchromatic (PAN) image with a low-resolution multispectral (MS) or hyperspectral (HS) image, respectively. Unfolding fusion methods integrate the powerful representation capabilities of deep learning with the robustness of model-based approaches. These techniques involve unrolling the steps of the optimization scheme derived from the minimization of an energy into a deep learning framework, resulting in efficient and highly interpretable architectures. In this paper, we propose a model-based deep unfolded method for satellite image fusion. Our approach is based on a variational formulation that incorporates the classic observation model for MS/HS data, a high-frequency injection constraint based on the PAN image, and an arbitrary convex prior. For the unfolding stage, we introduce upsampling and downsampling layers that use geometric information encoded in the PAN image through residual networks. The backbone of our method is a multi-head attention residual network (MARNet), which replaces the proximity operator in the optimization scheme and combines multiple head attentions with residual learning to exploit image self-similarities via nonlocal operators defined in terms of patches. Additionally, we incorporate a post-processing module based on the MARNet architecture to further enhance the quality of the fused images. Experimental results on PRISMA, Quickbird, and WorldView2 datasets demonstrate the superior performance of our method and its ability to generalize across different sensor configurations and varying spatial and spectral resolutions. The source code will be available at https://github.com/TAMI-UIB/MARNet.
☆ Standing on the Shoulders of Giants: Reprogramming Visual-Language Model for General Deepfake Detection
The proliferation of deepfake faces poses huge potential negative impacts on our daily lives. Despite substantial advancements in deepfake detection over these years, the generalizability of existing methods against forgeries from unseen datasets or created by emerging generative models remains constrained. In this paper, inspired by the zero-shot advantages of Vision-Language Models (VLMs), we propose a novel approach that repurposes a well-trained VLM for general deepfake detection. Motivated by the model reprogramming paradigm that manipulates the model prediction via data perturbations, our method can reprogram a pretrained VLM model (e.g., CLIP) solely based on manipulating its input without tuning the inner parameters. Furthermore, we insert a pseudo-word guided by facial identity into the text prompt. Extensive experiments on several popular benchmarks demonstrate that (1) the cross-dataset and cross-manipulation performances of deepfake detection can be significantly and consistently improved (e.g., over 88% AUC in cross-dataset setting from FF++ to WildDeepfake) using a pre-trained CLIP model with our proposed reprogramming method; (2) our superior performances are at less cost of trainable parameters, making it a promising approach for real-world applications.
☆ PoseTalk: Text-and-Audio-based Pose Control and Motion Refinement for One-Shot Talking Head Generation
While previous audio-driven talking head generation (THG) methods generate head poses from driving audio, the generated poses or lips cannot match the audio well or are not editable. In this study, we propose \textbf{PoseTalk}, a THG system that can freely generate lip-synchronized talking head videos with free head poses conditioned on text prompts and audio. The core insight of our method is using head pose to connect visual, linguistic, and audio signals. First, we propose to generate poses from both audio and text prompts, where the audio offers short-term variations and rhythm correspondence of the head movements and the text prompts describe the long-term semantics of head motions. To achieve this goal, we devise a Pose Latent Diffusion (PLD) model to generate motion latent from text prompts and audio cues in a pose latent space. Second, we observe a loss-imbalance problem: the loss for the lip region contributes less than 4\% of the total reconstruction loss caused by both pose and lip, making optimization lean towards head movements rather than lip shapes. To address this issue, we propose a refinement-based learning strategy to synthesize natural talking videos using two cascaded networks, i.e., CoarseNet, and RefineNet. The CoarseNet estimates coarse motions to produce animated images in novel poses and the RefineNet focuses on learning finer lip motions by progressively estimating lip motions from low-to-high resolutions, yielding improved lip-synchronization performance. Experiments demonstrate our pose prediction strategy achieves better pose diversity and realness compared to text-only or audio-only, and our video generator model outperforms state-of-the-art methods in synthesizing talking videos with natural head motions. Project: https://junleen.github.io/projects/posetalk.
comment: 7+5 pages, 15 figures
☆ Skip-and-Play: Depth-Driven Pose-Preserved Image Generation for Any Objects
The emergence of diffusion models has enabled the generation of diverse high-quality images solely from text, prompting subsequent efforts to enhance the controllability of these models. Despite the improvement in controllability, pose control remains limited to specific objects (e.g., humans) or poses (e.g., frontal view) due to the fact that pose is generally controlled via camera parameters (e.g., rotation angle) or keypoints (e.g., eyes, nose). Specifically, camera parameters-conditional pose control models generate unrealistic images depending on the object, owing to the small size of 3D datasets for training. Also, keypoint-based approaches encounter challenges in acquiring reliable keypoints for various objects (e.g., church) or poses (e.g., back view). To address these limitations, we propose depth-based pose control, as depth maps are easily obtainable from a single depth estimation model regardless of objects and poses, unlike camera parameters and keypoints. However, depth-based pose control confronts issues of shape dependency, as depth maps influence not only the pose but also the shape of the generated images. To tackle this issue, we propose Skip-and-Play (SnP), designed via analysis of the impact of three components of depth-conditional ControlNet on the pose and the shape of the generated images. To be specific, based on the analysis, we selectively skip parts of the components to mitigate shape dependency on the depth map while preserving the pose. Through various experiments, we demonstrate the superiority of SnP over baselines and showcase the ability of SnP to generate images of diverse objects and poses. Remarkably, SnP exhibits the ability to generate images even when the objects in the condition (e.g., a horse) and the prompt (e.g., a hedgehog) differ from each other.
☆ Creating a Microstructure Latent Space with Rich Material Information for Multiphase Alloy Design
The intricate microstructure serves as the cornerstone for the composition/processing-structure-property (CPSP) connection in multiphase alloys. Traditional alloy design methods often overlook microstructural details, which diminishes the reliability and effectiveness of the outcomes. This study introduces an improved alloy design algorithm that integrates authentic microstructural information to establish precise CPSP relationships. The approach utilizes a deep-learning framework based on a variational autoencoder to map real microstructural data to a latent space, enabling the prediction of composition, processing steps, and material properties from the latent space vector. By integrating this deep learning model with a specific sampling strategy in the latent space, a novel, microstructure-centered algorithm for multiphase alloy design is developed. This algorithm is demonstrated through the design of a unified dual-phase steel, and the results are assessed at three performance levels. Moreover, an exploration into the latent vector space of the model highlights its seamless interpolation ability and its rich material information content. Notably, the current configuration of the latent space is particularly advantageous for alloy design, offering an exhaustive representation of microstructure, composition, processing, and property variations essential for multiphase alloys.
☆ Learning-Based Error Detection System for Advanced Vehicle Instrument Cluster Rendering
The automotive industry is currently expanding digital display options with every new model that comes onto the market. This entails not just an expansion in dimensions, resolution, and customization choices, but also the capability to employ novel display effects like overlays while assembling the content of the display cluster. Unfortunately, this raises the need for appropriate monitoring systems that can detect rendering errors and apply appropriate countermeasures when required. Classical solutions such as Cyclic Redundancy Checks (CRC) will soon be no longer viable as any sort of alpha blending, warping of scaling of content can cause unwanted CRC violations. Therefore, we propose a novel monitoring approach to verify correctness of displayed content using telltales (e.g. warning signs) as example. It uses a learning-based approach to separate "good" telltales, i.e. those that a human driver will understand correctly, and "corrupted" telltales, i.e. those that will not be visible or perceived correctly. As a result, it possesses inherent resilience against individual pixel errors and implicitly supports changing backgrounds, overlay or scaling effects. This is underlined by our experimental study where all "corrupted" test patterns were correctly classified, while no false alarms were triggered.
comment: 9 pages
☆ MADiff: Motion-Aware Mamba Diffusion Models for Hand Trajectory Prediction on Egocentric Videos
Understanding human intentions and actions through egocentric videos is important on the path to embodied artificial intelligence. As a branch of egocentric vision techniques, hand trajectory prediction plays a vital role in comprehending human motion patterns, benefiting downstream tasks in extended reality and robot manipulation. However, capturing high-level human intentions consistent with reasonable temporal causality is challenging when only egocentric videos are available. This difficulty is exacerbated under camera egomotion interference and the absence of affordance labels to explicitly guide the optimization of hand waypoint distribution. In this work, we propose a novel hand trajectory prediction method dubbed MADiff, which forecasts future hand waypoints with diffusion models. The devised denoising operation in the latent space is achieved by our proposed motion-aware Mamba, where the camera wearer's egomotion is integrated to achieve motion-driven selective scan (MDSS). To discern the relationship between hands and scenarios without explicit affordance supervision, we leverage a foundation model that fuses visual and language features to capture high-level semantics from video clips. Comprehensive experiments conducted on five public datasets with the existing and our proposed new evaluation metrics demonstrate that MADiff predicts comparably reasonable hand trajectories compared to the state-of-the-art baselines, and achieves real-time performance. We will release our code and pretrained models of MADiff at the project page: https://irmvlab.github.io/madiff.github.io.
☆ Loopy: Taming Audio-Driven Portrait Avatar with Long-Term Motion Dependency
With the introduction of diffusion-based video generation techniques, audio-conditioned human video generation has recently achieved significant breakthroughs in both the naturalness of motion and the synthesis of portrait details. Due to the limited control of audio signals in driving human motion, existing methods often add auxiliary spatial signals to stabilize movements, which may compromise the naturalness and freedom of motion. In this paper, we propose an end-to-end audio-only conditioned video diffusion model named Loopy. Specifically, we designed an inter- and intra-clip temporal module and an audio-to-latents module, enabling the model to leverage long-term motion information from the data to learn natural motion patterns and improving audio-portrait movement correlation. This method removes the need for manually specified spatial motion templates used in existing methods to constrain motion during inference. Extensive experiments show that Loopy outperforms recent audio-driven portrait diffusion models, delivering more lifelike and high-quality results across various scenarios.
☆ AdvSecureNet: A Python Toolkit for Adversarial Machine Learning
Machine learning models are vulnerable to adversarial attacks. Several tools have been developed to research these vulnerabilities, but they often lack comprehensive features and flexibility. We introduce AdvSecureNet, a PyTorch based toolkit for adversarial machine learning that is the first to natively support multi-GPU setups for attacks, defenses, and evaluation. It is the first toolkit that supports both CLI and API interfaces and external YAML configuration files to enhance versatility and reproducibility. The toolkit includes multiple attacks, defenses and evaluation metrics. Rigiorous software engineering practices are followed to ensure high code quality and maintainability. The project is available as an open-source project on GitHub at https://github.com/melihcatal/advsecurenet and installable via PyPI.
☆ GoT-CQA: Graph-of-Thought Guided Compositional Reasoning for Chart Question Answering
Chart Question Answering (CQA) aims at answering questions based on the visual chart content, which plays an important role in chart sumarization, business data analysis, and data report generation. CQA is a challenging multi-modal task because of the strong context dependence and complex reasoning requirement. The former refers to answering this question strictly based on the analysis of the visual content or internal data of the given chart, while the latter emphasizes the various logical and numerical reasoning involved in answer prediction process. In this paper, we pay more attention on the complex reasoning in CQA task, and propose a novel Graph-of-Thought (GoT) guided compositional reasoning model called GoT-CQA to overcome this problem. At first, we transform the chart-oriented question into a directed acyclic GoT composed of multiple operator nodes, including localization, numerical and logical operator. It intuitively reflects the human brain's solution process to this question. After that, we design an efficient auto-compositional reasoning framework guided by the GoT, to excute the multi-step reasoning operations in various types of questions. Comprehensive experiments on ChartQA and PlotQA-D datasets show that GoT-CQA achieves outstanding performance, especially in complex human-written and reasoning questions, comparing with the latest popular baselines.
☆ A Medical Multimodal Large Language Model for Pediatric Pneumonia
Pediatric pneumonia is the leading cause of death among children under five years worldwide, imposing a substantial burden on affected families. Currently, there are three significant hurdles in diagnosing and treating pediatric pneumonia. Firstly, pediatric pneumonia shares similar symptoms with other respiratory diseases, making rapid and accurate differential diagnosis challenging. Secondly, primary hospitals often lack sufficient medical resources and experienced doctors. Lastly, providing personalized diagnostic reports and treatment recommendations is labor-intensive and time-consuming. To tackle these challenges, we proposed a Medical Multimodal Large Language Model for Pediatric Pneumonia (P2Med-MLLM). It was capable of handling diverse clinical tasks, such as generating free-text radiology reports and medical records within a unified framework. Specifically, P2Med-MLLM can process both pure text and image-text data, trained on an extensive and large-scale dataset (P2Med-MD), including real clinical information from 163,999 outpatient and 8,684 inpatient cases. This dataset comprised 2D chest X-ray images, 3D chest CT images, corresponding radiology reports, and outpatient and inpatient records. We designed a three-stage training strategy to enable P2Med-MLLM to comprehend medical knowledge and follow instructions for various clinical tasks. To rigorously evaluate P2Med-MLLM's performance, we developed P2Med-MBench, a benchmark consisting of 642 meticulously verified samples by pediatric pulmonology specialists, covering six clinical decision-support tasks and a balanced variety of diseases. The automated scoring results demonstrated the superiority of P2Med-MLLM. This work plays a crucial role in assisting primary care doctors with prompt disease diagnosis and treatment planning, reducing severe symptom mortality rates, and optimizing the allocation of medical resources.
comment: 18 pages, 10 figures
☆ A Fashion Item Recommendation Model in Hyperbolic Space CVPR 2024
In this work, we propose a fashion item recommendation model that incorporates hyperbolic geometry into user and item representations. Using hyperbolic space, our model aims to capture implicit hierarchies among items based on their visual data and users' purchase history. During training, we apply a multi-task learning framework that considers both hyperbolic and Euclidean distances in the loss function. Our experiments on three data sets show that our model performs better than previous models trained in Euclidean space only, confirming the effectiveness of our model. Our ablation studies show that multi-task learning plays a key role, and removing the Euclidean loss substantially deteriorates the model performance.
comment: This work was presented at the CVFAD Workshop at CVPR 2024
☆ SurgTrack: CAD-Free 3D Tracking of Real-world Surgical Instruments
Vision-based surgical navigation has received increasing attention due to its non-invasive, cost-effective, and flexible advantages. In particular, a critical element of the vision-based navigation system is tracking surgical instruments. Compared with 2D instrument tracking methods, 3D instrument tracking has broader value in clinical practice, but is also more challenging due to weak texture, occlusion, and lack of Computer-Aided Design (CAD) models for 3D registration. To solve these challenges, we propose the SurgTrack, a two-stage 3D instrument tracking method for CAD-free and robust real-world applications. In the first registration stage, we incorporate an Instrument Signed Distance Field (SDF) modeling the 3D representation of instruments, achieving CAD-freed 3D registration. Due to this, we can obtain the location and orientation of instruments in the 3D space by matching the video stream with the registered SDF model. In the second tracking stage, we devise a posture graph optimization module, leveraging the historical tracking results of the posture memory pool to optimize the tracking results and improve the occlusion robustness. Furthermore, we collect the Instrument3D dataset to comprehensively evaluate the 3D tracking of surgical instruments. The extensive experiments validate the superiority and scalability of our SurgTrack, by outperforming the state-of-the-arts with a remarkable improvement. The code and dataset are available at https://github.com/wenwucode/SurgTrack.
☆ BMI Prediction from Handwritten English Characters Using a Convolutional Neural Network
A person's Body Mass Index, or BMI, is the most widely used parameter for assessing their health. BMI is a crucial predictor of potential diseases that may arise at higher body fat levels because it is correlated with body fat. Conversely, a community's or an individual's nutritional status can be determined using the BMI. Although deep learning models are used in several studies to estimate BMI from face photos and other data, no previous research established a clear connection between deep learning techniques for handwriting analysis and BMI prediction. This article addresses this research gap with a deep learning approach to estimating BMI from handwritten characters by developing a convolutional neural network (CNN). A dataset containing samples from 48 people in lowercase English scripts is successfully captured for the BMI prediction task. The proposed CNN-based approach reports a commendable accuracy of 99.92%. Performance comparison with other popular CNN architectures reveals that AlexNet and InceptionV3 achieve the second and third-best performance, with the accuracy of 99.69% and 99.53%, respectively.
☆ Object Gaussian for Monocular 6D Pose Estimation from Sparse Views
Monocular object pose estimation, as a pivotal task in computer vision and robotics, heavily depends on accurate 2D-3D correspondences, which often demand costly CAD models that may not be readily available. Object 3D reconstruction methods offer an alternative, among which recent advancements in 3D Gaussian Splatting (3DGS) afford a compelling potential. Yet its performance still suffers and tends to overfit with fewer input views. Embracing this challenge, we introduce SGPose, a novel framework for sparse view object pose estimation using Gaussian-based methods. Given as few as ten views, SGPose generates a geometric-aware representation by starting with a random cuboid initialization, eschewing reliance on Structure-from-Motion (SfM) pipeline-derived geometry as required by traditional 3DGS methods. SGPose removes the dependence on CAD models by regressing dense 2D-3D correspondences between images and the reconstructed model from sparse input and random initialization, while the geometric-consistent depth supervision and online synthetic view warping are key to the success. Experiments on typical benchmarks, especially on the Occlusion LM-O dataset, demonstrate that SGPose outperforms existing methods even under sparse view constraints, under-scoring its potential in real-world applications.
☆ Solving Video Inverse Problems Using Image Diffusion Models
Recently, diffusion model-based inverse problem solvers (DIS) have emerged as state-of-the-art approaches for addressing inverse problems, including image super-resolution, deblurring, inpainting, etc. However, their application to video inverse problems arising from spatio-temporal degradation remains largely unexplored due to the challenges in training video diffusion models. To address this issue, here we introduce an innovative video inverse solver that leverages only image diffusion models. Specifically, by drawing inspiration from the success of the recent decomposed diffusion sampler (DDS), our method treats the time dimension of a video as the batch dimension of image diffusion models and solves spatio-temporal optimization problems within denoised spatio-temporal batches derived from each image diffusion model. Moreover, we introduce a batch-consistent diffusion sampling strategy that encourages consistency across batches by synchronizing the stochastic noise components in image diffusion models. Our approach synergistically combines batch-consistent sampling with simultaneous optimization of denoised spatio-temporal batches at each reverse diffusion step, resulting in a novel and efficient diffusion sampling strategy for video inverse problems. Experimental results demonstrate that our method effectively addresses various spatio-temporal degradations in video inverse problems, achieving state-of-the-art reconstructions. Project page: https://solving-video-inverse.github.io/main/
comment: 22 pages, 16 figures
☆ Evaluation Study on SAM 2 for Class-agnostic Instance-level Segmentation
Segment Anything Model (SAM) has demonstrated powerful zero-shot segmentation performance in natural scenes. The recently released Segment Anything Model 2 (SAM2) has further heightened researchers' expectations towards image segmentation capabilities. To evaluate the performance of SAM2 on class-agnostic instance-level segmentation tasks, we adopt different prompt strategies for SAM2 to cope with instance-level tasks for three relevant scenarios: Salient Instance Segmentation (SIS), Camouflaged Instance Segmentation (CIS), and Shadow Instance Detection (SID). In addition, to further explore the effectiveness of SAM2 in segmenting granular object structures, we also conduct detailed tests on the high-resolution Dichotomous Image Segmentation (DIS) benchmark to assess the fine-grained segmentation capability. Qualitative and quantitative experimental results indicate that the performance of SAM2 varies significantly across different scenarios. Besides, SAM2 is not particularly sensitive to segmenting high-resolution fine details. We hope this technique report can drive the emergence of SAM2-based adapters, aiming to enhance the performance ceiling of large vision models on class-agnostic instance segmentation tasks.
☆ How Do You Perceive My Face? Recognizing Facial Expressions in Multi-Modal Context by Modeling Mental Representations
Facial expression perception in humans inherently relies on prior knowledge and contextual cues, contributing to efficient and flexible processing. For instance, multi-modal emotional context (such as voice color, affective text, body pose, etc.) can prompt people to perceive emotional expressions in objectively neutral faces. Drawing inspiration from this, we introduce a novel approach for facial expression classification that goes beyond simple classification tasks. Our model accurately classifies a perceived face and synthesizes the corresponding mental representation perceived by a human when observing a face in context. With this, our model offers visual insights into its internal decision-making process. We achieve this by learning two independent representations of content and context using a VAE-GAN architecture. Subsequently, we propose a novel attention mechanism for context-dependent feature adaptation. The adapted representation is used for classification and to generate a context-augmented expression. We evaluate synthesized expressions in a human study, showing that our model effectively produces approximations of human mental representations. We achieve State-of-the-Art classification accuracies of 81.01% on the RAVDESS dataset and 79.34% on the MEAD dataset. We make our code publicly available.
comment: GCPR 2024
☆ Interacting Multiple Model-based Joint Homography Matrix and Multiple Object State Estimation
A novel MOT algorithm, IMM Joint Homography State Estimation (IMM-JHSE), is proposed. By jointly modelling the camera projection matrix as part of track state vectors, IMM-JHSE removes the explicit influence of camera motion compensation techniques on predicted track position states, which was prevalent in previous approaches. Expanding upon this, static and dynamic camera motion models are combined through the use of an IMM filter. A simple bounding box motion model is used to predict bounding box positions to incorporate image plane information. In addition to applying an IMM to camera motion, a non-standard IMM approach is applied where bounding-box-based BIoU scores are mixed with ground-plane-based Mahalanobis distances in an IMM-like fashion to perform association only. Finally, IMM-JHSE makes use of dynamic process and measurement noise estimation techniques. IMM-JHSE improves upon related techniques on the DanceTrack and KITTI-car datasets, increasing HOTA by 2.64 and 2.11, respectively, while offering competitive performance on the MOT17, MOT20 and KITTI-pedestrian datasets.
comment: Preprint submitted to Information Fusion
☆ Low-Resolution Object Recognition with Cross-Resolution Relational Contrastive Distillation
Recognizing objects in low-resolution images is a challenging task due to the lack of informative details. Recent studies have shown that knowledge distillation approaches can effectively transfer knowledge from a high-resolution teacher model to a low-resolution student model by aligning cross-resolution representations. However, these approaches still face limitations in adapting to the situation where the recognized objects exhibit significant representation discrepancies between training and testing images. In this study, we propose a cross-resolution relational contrastive distillation approach to facilitate low-resolution object recognition. Our approach enables the student model to mimic the behavior of a well-trained teacher model which delivers high accuracy in identifying high-resolution objects. To extract sufficient knowledge, the student learning is supervised with contrastive relational distillation loss, which preserves the similarities in various relational structures in contrastive representation space. In this manner, the capability of recovering missing details of familiar low-resolution objects can be effectively enhanced, leading to a better knowledge transfer. Extensive experiments on low-resolution object classification and low-resolution face recognition clearly demonstrate the effectiveness and adaptability of our approach.
comment: This paper is accepted by IEEE Transactions on Circuits and Systems for Video Technology (TCSVT)
☆ Real-Time Dynamic Scale-Aware Fusion Detection Network: Take Road Damage Detection as an example
Unmanned Aerial Vehicle (UAV)-based Road Damage Detection (RDD) is important for daily maintenance and safety in cities, especially in terms of significantly reducing labor costs. However, current UAV-based RDD research is still faces many challenges. For example, the damage with irregular size and direction, the masking of damage by the background, and the difficulty of distinguishing damage from the background significantly affect the ability of UAV to detect road damage in daily inspection. To solve these problems and improve the performance of UAV in real-time road damage detection, we design and propose three corresponding modules: a feature extraction module that flexibly adapts to shape and background; a module that fuses multiscale perception and adapts to shape and background ; an efficient downsampling module. Based on these modules, we designed a multi-scale, adaptive road damage detection model with the ability to automatically remove background interference, called Dynamic Scale-Aware Fusion Detection Model (RT-DSAFDet). Experimental results on the UAV-PDD2023 public dataset show that our model RT-DSAFDet achieves a mAP50 of 54.2%, which is 11.1% higher than that of YOLOv10-m, an efficient variant of the latest real-time object detection model YOLOv10, while the amount of parameters is reduced to 1.8M and FLOPs to 4.6G, with a decreased by 88% and 93%, respectively. Furthermore, on the large generalized object detection public dataset MS COCO2017 also shows the superiority of our model with mAP50-95 is the same as YOLOv9-t, but with 0.5% higher mAP50, 10% less parameters volume, and 40% less FLOPs.
☆ UniTT-Stereo: Unified Training of Transformer for Enhanced Stereo Matching
Unlike other vision tasks where Transformer-based approaches are becoming increasingly common, stereo depth estimation is still dominated by convolution-based approaches. This is mainly due to the limited availability of real-world ground truth for stereo matching, which is a limiting factor in improving the performance of Transformer-based stereo approaches. In this paper, we propose UniTT-Stereo, a method to maximize the potential of Transformer-based stereo architectures by unifying self-supervised learning used for pre-training with stereo matching framework based on supervised learning. To be specific, we explore the effectiveness of reconstructing features of masked portions in an input image and at the same time predicting corresponding points in another image from the perspective of locality inductive bias, which is crucial in training models with limited training data. Moreover, to address these challenging tasks of reconstruction-and-prediction, we present a new strategy to vary a masking ratio when training the stereo model with stereo-tailored losses. State-of-the-art performance of UniTT-Stereo is validated on various benchmarks such as ETH3D, KITTI 2012, and KITTI 2015 datasets. Lastly, to investigate the advantages of the proposed approach, we provide a frequency analysis of feature maps and the analysis of locality inductive bias based on attention maps.
☆ StyleTokenizer: Defining Image Style by a Single Instance for Controlling Diffusion Models ECCV2024
Despite the burst of innovative methods for controlling the diffusion process, effectively controlling image styles in text-to-image generation remains a challenging task. Many adapter-based methods impose image representation conditions on the denoising process to accomplish image control. However these conditions are not aligned with the word embedding space, leading to interference between image and text control conditions and the potential loss of semantic information from the text prompt. Addressing this issue involves two key challenges. Firstly, how to inject the style representation without compromising the effectiveness of text representation in control. Secondly, how to obtain the accurate style representation from a single reference image. To tackle these challenges, we introduce StyleTokenizer, a zero-shot style control image generation method that aligns style representation with text representation using a style tokenizer. This alignment effectively minimizes the impact on the effectiveness of text prompts. Furthermore, we collect a well-labeled style dataset named Style30k to train a style feature extractor capable of accurately representing style while excluding other content information. Experimental results demonstrate that our method fully grasps the style characteristics of the reference image, generating appealing images that are consistent with both the target image style and text prompt. The code and dataset are available at https://github.com/alipay/style-tokenizer.
comment: Accepted by ECCV2024
☆ Sample what you cant compress
For learned image representations, basic autoencoders often produce blurry results. Reconstruction quality can be improved by incorporating additional penalties such as adversarial (GAN) and perceptual losses. Arguably, these approaches lack a principled interpretation. Concurrently, in generative settings diffusion has demonstrated a remarkable ability to create crisp, high quality results and has solid theoretical underpinnings (from variational inference to direct study as the Fisher Divergence). Our work combines autoencoder representation learning with diffusion and is, to our knowledge, the first to demonstrate the efficacy of jointly learning a continuous encoder and decoder under a diffusion-based loss. We demonstrate that this approach yields better reconstruction quality as compared to GAN-based autoencoders while being easier to tune. We also show that the resulting representation is easier to model with a latent diffusion model as compared to the representation obtained from a state-of-the-art GAN-based loss. Since our decoder is stochastic, it can generate details not encoded in the otherwise deterministic latent representation; we therefore name our approach "Sample what you can't compress", or SWYCC for short.
☆ SG-MIM: Structured Knowledge Guided Efficient Pre-training for Dense Prediction
Masked Image Modeling (MIM) techniques have redefined the landscape of computer vision, enabling pre-trained models to achieve exceptional performance across a broad spectrum of tasks. Despite their success, the full potential of MIM-based methods in dense prediction tasks, particularly in depth estimation, remains untapped. Existing MIM approaches primarily rely on single-image inputs, which makes it challenging to capture the crucial structured information, leading to suboptimal performance in tasks requiring fine-grained feature representation. To address these limitations, we propose SG-MIM, a novel Structured knowledge Guided Masked Image Modeling framework designed to enhance dense prediction tasks by utilizing structured knowledge alongside images. SG-MIM employs a lightweight relational guidance framework, allowing it to guide structured knowledge individually at the feature level rather than naively combining at the pixel level within the same architecture, as is common in traditional multi-modal pre-training methods. This approach enables the model to efficiently capture essential information while minimizing discrepancies between pre-training and downstream tasks. Furthermore, SG-MIM employs a selective masking strategy to incorporate structured knowledge, maximizing the synergy between general representation learning and structured knowledge-specific learning. Our method requires no additional annotations, making it a versatile and efficient solution for a wide range of applications. Our evaluations on the KITTI, NYU-v2, and ADE20k datasets demonstrate SG-MIM's superiority in monocular depth estimation and semantic segmentation.
☆ TLD: A Vehicle Tail Light signal Dataset and Benchmark
Understanding other drivers' intentions is crucial for safe driving. The role of taillights in conveying these intentions is underemphasized in current autonomous driving systems. Accurately identifying taillight signals is essential for predicting vehicle behavior and preventing collisions. Open-source taillight datasets are scarce, often small and inconsistently annotated. To address this gap, we introduce a new large-scale taillight dataset called TLD. Sourced globally, our dataset covers diverse traffic scenarios. To our knowledge, TLD is the first dataset to separately annotate brake lights and turn signals in real driving scenarios. We collected 17.78 hours of driving videos from the internet. This dataset consists of 152k labeled image frames sampled at a rate of 2 Hz, along with 1.5 million unlabeled frames interspersed throughout. Additionally, we have developed a two-stage vehicle light detection model consisting of two primary modules: a vehicle detector and a taillight classifier. Initially, YOLOv10 and DeepSORT captured consecutive vehicle images over time. Subsequently, the two classifiers work simultaneously to determine the states of the brake lights and turn signals. A post-processing procedure is then used to eliminate noise caused by misidentifications and provide the taillight states of the vehicle within a given time frame. Our method shows exceptional performance on our dataset, establishing a benchmark for vehicle taillight detection. The dataset is available at https://huggingface.co/datasets/ChaiJohn/TLD/tree/main
☆ A Learnable Color Correction Matrix for RAW Reconstruction BMVC2024
Autonomous driving algorithms usually employ sRGB images as model input due to their compatibility with the human visual system. However, visually pleasing sRGB images are possibly sub-optimal for downstream tasks when compared to RAW images. The availability of RAW images is constrained by the difficulties in collecting real-world driving data and the associated challenges of annotation. To address this limitation and support research in RAW-domain driving perception, we design a novel and ultra-lightweight RAW reconstruction method. The proposed model introduces a learnable color correction matrix (CCM), which uses only a single convolutional layer to approximate the complex inverse image signal processor (ISP). Experimental results demonstrate that simulated RAW (simRAW) images generated by our method provide performance improvements equivalent to those produced by more complex inverse ISP methods when pretraining RAW-domain object detectors, which highlights the effectiveness and practicality of our approach.
comment: Accepted by BMVC2024
☆ Plane2Depth: Hierarchical Adaptive Plane Guidance for Monocular Depth Estimation
Monocular depth estimation aims to infer a dense depth map from a single image, which is a fundamental and prevalent task in computer vision. Many previous works have shown impressive depth estimation results through carefully designed network structures, but they usually ignore the planar information and therefore perform poorly in low-texture areas of indoor scenes. In this paper, we propose Plane2Depth, which adaptively utilizes plane information to improve depth prediction within a hierarchical framework. Specifically, in the proposed plane guided depth generator (PGDG), we design a set of plane queries as prototypes to softly model planes in the scene and predict per-pixel plane coefficients. Then the predicted plane coefficients can be converted into metric depth values with the pinhole camera model. In the proposed adaptive plane query aggregation (APGA) module, we introduce a novel feature interaction approach to improve the aggregation of multi-scale plane features in a top-down manner. Extensive experiments show that our method can achieve outstanding performance, especially in low-texture or repetitive areas. Furthermore, under the same backbone network, our method outperforms the state-of-the-art methods on the NYU-Depth-v2 dataset, achieves competitive results with state-of-the-art methods KITTI dataset and can be generalized to unseen scenes effectively.
comment: 14 pages, 12 figures, 8 tables
☆ Reliable Deep Diffusion Tensor Estimation: Rethinking the Power of Data-Driven Optimization Routine
Diffusion tensor imaging (DTI) holds significant importance in clinical diagnosis and neuroscience research. However, conventional model-based fitting methods often suffer from sensitivity to noise, leading to decreased accuracy in estimating DTI parameters. While traditional data-driven deep learning methods have shown potential in terms of accuracy and efficiency, their limited generalization to out-of-training-distribution data impedes their broader application due to the diverse scan protocols used across centers, scanners, and studies. This work aims to tackle these challenges and promote the use of DTI by introducing a data-driven optimization-based method termed DoDTI. DoDTI combines the weighted linear least squares fitting algorithm and regularization by denoising technique. The former fits DW images from diverse acquisition settings into diffusion tensor field, while the latter applies a deep learning-based denoiser to regularize the diffusion tensor field instead of the DW images, which is free from the limitation of fixed-channel assignment of the network. The optimization object is solved using the alternating direction method of multipliers and then unrolled to construct a deep neural network, leveraging a data-driven strategy to learn network parameters. Extensive validation experiments are conducted utilizing both internally simulated datasets and externally obtained in-vivo datasets. The results, encompassing both qualitative and quantitative analyses, showcase that the proposed method attains state-of-the-art performance in DTI parameter estimation. Notably, it demonstrates superior generalization, accuracy, and efficiency, rendering it highly reliable for widespread application in the field.
☆ TP-GMOT: Tracking Generic Multiple Object by Textual Prompt with Motion-Appearance Cost (MAC) SORT
While Multi-Object Tracking (MOT) has made substantial advancements, it is limited by heavy reliance on prior knowledge and limited to predefined categories. In contrast, Generic Multiple Object Tracking (GMOT), tracking multiple objects with similar appearance, requires less prior information about the targets but faces challenges with variants like viewpoint, lighting, occlusion, and resolution. Our contributions commence with the introduction of the \textbf{\text{Refer-GMOT dataset}} a collection of videos, each accompanied by fine-grained textual descriptions of their attributes. Subsequently, we introduce a novel text prompt-based open-vocabulary GMOT framework, called \textbf{\text{TP-GMOT}}, which can track never-seen object categories with zero training examples. Within \text{TP-GMOT} framework, we introduce two novel components: (i) {\textbf{\text{TP-OD}}, an object detection by a textual prompt}, for accurately detecting unseen objects with specific characteristics. (ii) Motion-Appearance Cost SORT \textbf{\text{MAC-SORT}}, a novel object association approach that adeptly integrates motion and appearance-based matching strategies to tackle the complex task of tracking multiple generic objects with high similarity. Our contributions are benchmarked on the \text{Refer-GMOT} dataset for GMOT task. Additionally, to assess the generalizability of the proposed \text{TP-GMOT} framework and the effectiveness of \text{MAC-SORT} tracker, we conduct ablation studies on the DanceTrack and MOT20 datasets for the MOT task. Our dataset, code, and models will be publicly available at: https://fsoft-aic.github.io/TP-GMOT
☆ Boosting Generalizability towards Zero-Shot Cross-Dataset Single-Image Indoor Depth by Meta-Initialization IROS 2024
Indoor robots rely on depth to perform tasks like navigation or obstacle detection, and single-image depth estimation is widely used to assist perception. Most indoor single-image depth prediction focuses less on model generalizability to unseen datasets, concerned with in-the-wild robustness for system deployment. This work leverages gradient-based meta-learning to gain higher generalizability on zero-shot cross-dataset inference. Unlike the most-studied meta-learning of image classification associated with explicit class labels, no explicit task boundaries exist for continuous depth values tied to highly varying indoor environments regarding object arrangement and scene composition. We propose fine-grained task that treats each RGB-D mini-batch as a task in our meta-learning formulation. We first show that our method on limited data induces a much better prior (max 27.8% in RMSE). Then, finetuning on meta-learned initialization consistently outperforms baselines without the meta approach. Aiming at generalization, we propose zero-shot cross-dataset protocols and validate higher generalizability induced by our meta-initialization, as a simple and useful plugin to many existing depth estimation methods. The work at the intersection of depth and meta-learning potentially drives both research to step closer to practical robotic and machine perception usage.
comment: IROS 2024. The version supersedes 2305.07269. arXiv admin note: text overlap with arXiv:2305.07269
☆ TASAR: Transferable Attack on Skeletal Action Recognition
Skeletal sequences, as well-structured representations of human behaviors, are crucial in Human Activity Recognition (HAR). The transferability of adversarial skeletal sequences enables attacks in real-world HAR scenarios, such as autonomous driving, intelligent surveillance, and human-computer interactions. However, existing Skeleton-based HAR (S-HAR) attacks exhibit weak adversarial transferability and, therefore, cannot be considered true transfer-based S-HAR attacks. More importantly, the reason for this failure remains unclear. In this paper, we study this phenomenon through the lens of loss surface, and find that its sharpness contributes to the poor transferability in S-HAR. Inspired by this observation, we assume and empirically validate that smoothening the rugged loss landscape could potentially improve adversarial transferability in S-HAR. To this end, we propose the first Transfer-based Attack on Skeletal Action Recognition, TASAR. TASAR explores the smoothed model posterior without re-training the pre-trained surrogates, which is achieved by a new post-train Dual Bayesian optimization strategy. Furthermore, unlike previous transfer-based attacks that treat each frame independently and overlook temporal coherence within sequences, TASAR incorporates motion dynamics into the Bayesian attack gradient, effectively disrupting the spatial-temporal coherence of S-HARs. To exhaustively evaluate the effectiveness of existing methods and our method, we build the first large-scale robust S-HAR benchmark, comprising 7 S-HAR models, 10 attack methods, 3 S-HAR datasets and 2 defense models. Extensive results demonstrate the superiority of TASAR. Our benchmark enables easy comparisons for future studies, with the code available in the supplementary material.
comment: arXiv admin note: text overlap with arXiv:2407.08572
☆ Volumetric Surfaces: Representing Fuzzy Geometries with Multiple Meshes
High-quality real-time view synthesis methods are based on volume rendering, splatting, or surface rendering. While surface-based methods generally are the fastest, they cannot faithfully model fuzzy geometry like hair. In turn, alpha-blending techniques excel at representing fuzzy materials but require an unbounded number of samples per ray (P1). Further overheads are induced by empty space skipping in volume rendering (P2) and sorting input primitives in splatting (P3). These problems are exacerbated on low-performance graphics hardware, e.g. on mobile devices. We present a novel representation for real-time view synthesis where the (P1) number of sampling locations is small and bounded, (P2) sampling locations are efficiently found via rasterization, and (P3) rendering is sorting-free. We achieve this by representing objects as semi-transparent multi-layer meshes, rendered in fixed layer order from outermost to innermost. We model mesh layers as SDF shells with optimal spacing learned during training. After baking, we fit UV textures to the corresponding meshes. We show that our method can represent challenging fuzzy objects while achieving higher frame rates than volume-based and splatting-based methods on low-end and mobile devices.
☆ FrameCorr: Adaptive, Autoencoder-based Neural Compression for Video Reconstruction in Resource and Timing Constrained Network Settings
Despite the growing adoption of video processing via Internet of Things (IoT) devices due to their cost-effectiveness, transmitting captured data to nearby servers poses challenges due to varying timing constraints and scarcity of network bandwidth. Existing video compression methods face difficulties in recovering compressed data when incomplete data is provided. Here, we introduce \emph{\project}, a deep-learning based solution that utilizes previously received data to predict the missing segments of a frame, enabling the reconstruction of a frame from partially received data.
☆ Detecting Korean Food Using Image using Hierarchical Model
A solution was made available for Korean Food lovers who have dietary restrictions to identify the Korean food before consuming. Just by uploading a clear photo of the dish, people can get to know what they are eating. Image processing techniques together with machine learning helped to come up with this solution.
☆ Non-target Divergence Hypothesis: Toward Understanding Domain Gaps in Cross-Modal Knowledge Distillation
Compared to single-modal knowledge distillation, cross-modal knowledge distillation faces more severe challenges due to domain gaps between modalities. Although various methods have proposed various solutions to overcome these challenges, there is still limited research on how domain gaps affect cross-modal knowledge distillation. This paper provides an in-depth analysis and evaluation of this issue. We first introduce the Non-Target Divergence Hypothesis (NTDH) to reveal the impact of domain gaps on cross-modal knowledge distillation. Our key finding is that domain gaps between modalities lead to distribution differences in non-target classes, and the smaller these differences, the better the performance of cross-modal knowledge distillation. Subsequently, based on Vapnik-Chervonenkis (VC) theory, we derive the upper and lower bounds of the approximation error for cross-modal knowledge distillation, thereby theoretically validating the NTDH. Finally, experiments on five cross-modal datasets further confirm the validity, generalisability, and applicability of the NTDH.
☆ Training-free Color-Style Disentanglement for Constrained Text-to-Image Synthesis
We consider the problem of independently, in a disentangled fashion, controlling the outputs of text-to-image diffusion models with color and style attributes of a user-supplied reference image. We present the first training-free, test-time-only method to disentangle and condition text-to-image models on color and style attributes from reference image. To realize this, we propose two key innovations. Our first contribution is to transform the latent codes at inference time using feature transformations that make the covariance matrix of current generation follow that of the reference image, helping meaningfully transfer color. Next, we observe that there exists a natural disentanglement between color and style in the LAB image space, which we exploit to transform the self-attention feature maps of the image being generated with respect to those of the reference computed from its L channel. Both these operations happen purely at test time and can be done independently or merged. This results in a flexible method where color and style information can come from the same reference image or two different sources, and a new generation can seamlessly fuse them in either scenario.
comment: 16 pages, 17 figures
☆ Diffusion Models Learn Low-Dimensional Distributions via Subspace Clustering
Recent empirical studies have demonstrated that diffusion models can effectively learn the image distribution and generate new samples. Remarkably, these models can achieve this even with a small number of training samples despite a large image dimension, circumventing the curse of dimensionality. In this work, we provide theoretical insights into this phenomenon by leveraging key empirical observations: (i) the low intrinsic dimensionality of image data, (ii) a union of manifold structure of image data, and (iii) the low-rank property of the denoising autoencoder in trained diffusion models. These observations motivate us to assume the underlying data distribution of image data as a mixture of low-rank Gaussians and to parameterize the denoising autoencoder as a low-rank model according to the score function of the assumed distribution. With these setups, we rigorously show that optimizing the training loss of diffusion models is equivalent to solving the canonical subspace clustering problem over the training samples. Based on this equivalence, we further show that the minimal number of samples required to learn the underlying distribution scales linearly with the intrinsic dimensions under the above data and model assumptions. This insight sheds light on why diffusion models can break the curse of dimensionality and exhibit the phase transition in learning distributions. Moreover, we empirically establish a correspondence between the subspaces and the semantic representations of image data, facilitating image editing. We validate these results with corroborated experimental results on both simulated distributions and image datasets.
comment: 39 pages, 9 figures
☆ MOSMOS: Multi-organ segmentation facilitated by medical report supervision
Owing to a large amount of multi-modal data in modern medical systems, such as medical images and reports, Medical Vision-Language Pre-training (Med-VLP) has demonstrated incredible achievements in coarse-grained downstream tasks (i.e., medical classification, retrieval, and visual question answering). However, the problem of transferring knowledge learned from Med-VLP to fine-grained multi-organ segmentation tasks has barely been investigated. Multi-organ segmentation is challenging mainly due to the lack of large-scale fully annotated datasets and the wide variation in the shape and size of the same organ between individuals with different diseases. In this paper, we propose a novel pre-training & fine-tuning framework for Multi-Organ Segmentation by harnessing Medical repOrt Supervision (MOSMOS). Specifically, we first introduce global contrastive learning to maximally align the medical image-report pairs in the pre-training stage. To remedy the granularity discrepancy, we further leverage multi-label recognition to implicitly learn the semantic correspondence between image pixels and organ tags. More importantly, our pre-trained models can be transferred to any segmentation model by introducing the pixel-tag attention maps. Different network settings, i.e., 2D U-Net and 3D UNETR, are utilized to validate the generalization. We have extensively evaluated our approach using different diseases and modalities on BTCV, AMOS, MMWHS, and BRATS datasets. Experimental results in various settings demonstrate the effectiveness of our framework. This framework can serve as the foundation to facilitate future research on automatic annotation tasks under the supervision of medical reports.
comment: 14 pages, 7 figures
☆ Local map Construction Methods with SD map: A Novel Survey
In recent years, significant academic advancements have been made in the field of autonomous vehicles, with Local maps emerging as a crucial component of autonomous driving technology. Local maps not only provide intricate details of road networks but also serve as fundamental inputs for critical tasks such as vehicle localization, navigation, and decision-making. Given the characteristics of SD map (Standard Definition Map), which include low cost, ease of acquisition, and high versatility, perception methods that integrate SD map as prior information have demonstrated significant potential in the field of Local map perception. The purpose of this paper is to provide researchers with a comprehensive overview and summary of the latest advancements in the integration of SD map as prior information for Local map perception methods. This review begins by introducing the task definition and general pipeline of local map perception methods that incorporate SD maps as prior information, along with relevant public datasets. And then it focuses on the representation and encoding methods of multi-source information, as well as the methods for fusing multi-source information. In response to this burgeoning trend, this article presents a comprehensive and meticulous overview of the diverse research efforts in this particular field. Finally, the article addresses pertinent issues and future challenges with the aim of guiding researchers in understanding the current trends and methodologies prevalent in the field.
comment: 14 pages, 11 figures
☆ Hadamard Row-Wise Generation Algorithm
In this paper, we introduce an efficient algorithm for generating specific Hadamard rows, addressing the memory demands of pre-computing the entire matrix. Leveraging Sylvester's recursive construction, our method generates the required $i$-th row on demand, significantly reducing computational resources. The algorithm uses the Kronecker product to construct the desired row from the binary representation of the index, without creating the full matrix. This approach is particularly useful for single-pixel imaging systems that need only one row at a time.
☆ Neural Dynamics Model of Visual Decision-Making: Learning from Human Experts
Uncovering the fundamental neural correlates of biological intelligence, developing mathematical models, and conducting computational simulations are critical for advancing new paradigms in artificial intelligence (AI). In this study, we implemented a comprehensive visual decision-making model that spans from visual input to behavioral output, using a neural dynamics modeling approach. Drawing inspiration from the key components of the dorsal visual pathway in primates, our model not only aligns closely with human behavior but also reflects neural activities in primates, and achieving accuracy comparable to convolutional neural networks (CNNs). Moreover, magnetic resonance imaging (MRI) identified key neuroimaging features such as structural connections and functional connectivity that are associated with performance in perceptual decision-making tasks. A neuroimaging-informed fine-tuning approach was introduced and applied to the model, leading to performance improvements that paralleled the behavioral variations observed among subjects. Compared to classical deep learning models, our model more accurately replicates the behavioral performance of biological intelligence, relying on the structural characteristics of biological neural networks rather than extensive training data, and demonstrating enhanced resilience to perturbation.
☆ Multi-modal Situated Reasoning in 3D Scenes
Situation awareness is essential for understanding and reasoning about 3D scenes in embodied AI agents. However, existing datasets and benchmarks for situated understanding are limited in data modality, diversity, scale, and task scope. To address these limitations, we propose Multi-modal Situated Question Answering (MSQA), a large-scale multi-modal situated reasoning dataset, scalably collected leveraging 3D scene graphs and vision-language models (VLMs) across a diverse range of real-world 3D scenes. MSQA includes 251K situated question-answering pairs across 9 distinct question categories, covering complex scenarios within 3D scenes. We introduce a novel interleaved multi-modal input setting in our benchmark to provide text, image, and point cloud for situation and question description, resolving ambiguity in previous single-modality convention (e.g., text). Additionally, we devise the Multi-modal Situated Next-step Navigation (MSNN) benchmark to evaluate models' situated reasoning for navigation. Comprehensive evaluations on MSQA and MSNN highlight the limitations of existing vision-language models and underscore the importance of handling multi-modal interleaved inputs and situation modeling. Experiments on data scaling and cross-domain transfer further demonstrate the efficacy of leveraging MSQA as a pre-training dataset for developing more powerful situated reasoning models.
comment: Project page: https://msr3d.github.io/
☆ Unified Framework with Consistency across Modalities for Human Activity Recognition BMVC 2024
Recognizing human activities in videos is challenging due to the spatio-temporal complexity and context-dependence of human interactions. Prior studies often rely on single input modalities, such as RGB or skeletal data, limiting their ability to exploit the complementary advantages across modalities. Recent studies focus on combining these two modalities using simple feature fusion techniques. However, due to the inherent disparities in representation between these input modalities, designing a unified neural network architecture to effectively leverage their complementary information remains a significant challenge. To address this, we propose a comprehensive multimodal framework for robust video-based human activity recognition. Our key contribution is the introduction of a novel compositional query machine, called COMPUTER ($\textbf{COMP}ositional h\textbf{U}man-cen\textbf{T}ric qu\textbf{ER}y$ machine), a generic neural architecture that models the interactions between a human of interest and its surroundings in both space and time. Thanks to its versatile design, COMPUTER can be leveraged to distill distinctive representations for various input modalities. Additionally, we introduce a consistency loss that enforces agreement in prediction between modalities, exploiting the complementary information from multimodal inputs for robust human movement recognition. Through extensive experiments on action localization and group activity recognition tasks, our approach demonstrates superior performance when compared with state-of-the-art methods. Our code is available at: https://github.com/tranxuantuyen/COMPUTER.
comment: Accepted to BMVC 2024
☆ GGS: Generalizable Gaussian Splatting for Lane Switching in Autonomous Driving
We propose GGS, a Generalizable Gaussian Splatting method for Autonomous Driving which can achieve realistic rendering under large viewpoint changes. Previous generalizable 3D gaussian splatting methods are limited to rendering novel views that are very close to the original pair of images, which cannot handle large differences in viewpoint. Especially in autonomous driving scenarios, images are typically collected from a single lane. The limited training perspective makes rendering images of a different lane very challenging. To further improve the rendering capability of GGS under large viewpoint changes, we introduces a novel virtual lane generation module into GSS method to enables high-quality lane switching even without a multi-lane dataset. Besides, we design a diffusion loss to supervise the generation of virtual lane image to further address the problem of lack of data in the virtual lanes. Finally, we also propose a depth refinement module to optimize depth estimation in the GSS model. Extensive validation of our method, compared to existing approaches, demonstrates state-of-the-art performance.
☆ Coral Model Generation from Single Images for Virtual Reality Applications
With the rapid development of VR technology, the demand for high-quality 3D models is increasing. Traditional methods struggle with efficiency and quality in large-scale customization. This paper introduces a deep-learning framework that generates high-precision 3D coral models from a single image. Using the Coral dataset, the framework extracts geometric and texture features, performs 3D reconstruction, and optimizes design and material blending. Advanced optimization and polygon count control ensure shape accuracy, detail retention, and flexible output for various complexities, catering to high-quality rendering and real-time interaction needs.The project incorporates Explainable AI (XAI) to transform AI-generated models into interactive "artworks," best viewed in VR and XR. This enhances model interpretability and human-machine collaboration. Real-time feedback in VR interactions displays information like coral species and habitat, enriching user experience. The generated models surpass traditional methods in detail, visual quality, and efficiency. This research offers an intelligent approach to 3D content creation for VR, lowering production barriers, and promoting widespread VR applications. Additionally, integrating XAI provides new insights into AI-generated visual content and advances research in 3D vision interpretability.
comment: In Proceedings of Explainable AI for the Arts Workshop 2024 (XAIxArts 2024) arXiv:2406.14485
☆ Exploring Low-Dimensional Subspaces in Diffusion Models for Controllable Image Editing
Recently, diffusion models have emerged as a powerful class of generative models. Despite their success, there is still limited understanding of their semantic spaces. This makes it challenging to achieve precise and disentangled image generation without additional training, especially in an unsupervised way. In this work, we improve the understanding of their semantic spaces from intriguing observations: among a certain range of noise levels, (1) the learned posterior mean predictor (PMP) in the diffusion model is locally linear, and (2) the singular vectors of its Jacobian lie in low-dimensional semantic subspaces. We provide a solid theoretical basis to justify the linearity and low-rankness in the PMP. These insights allow us to propose an unsupervised, single-step, training-free LOw-rank COntrollable image editing (LOCO Edit) method for precise local editing in diffusion models. LOCO Edit identified editing directions with nice properties: homogeneity, transferability, composability, and linearity. These properties of LOCO Edit benefit greatly from the low-dimensional semantic subspace. Our method can further be extended to unsupervised or text-supervised editing in various text-to-image diffusion models (T-LOCO Edit). Finally, extensive empirical experiments demonstrate the effectiveness and efficiency of LOCO Edit. The codes will be released at https://github.com/ChicyChen/LOCO-Edit.
☆ Unfolding Videos Dynamics via Taylor Expansion
Taking inspiration from physical motion, we present a new self-supervised dynamics learning strategy for videos: Video Time-Differentiation for Instance Discrimination (ViDiDi). ViDiDi is a simple and data-efficient strategy, readily applicable to existing self-supervised video representation learning frameworks based on instance discrimination. At its core, ViDiDi observes different aspects of a video through various orders of temporal derivatives of its frame sequence. These derivatives, along with the original frames, support the Taylor series expansion of the underlying continuous dynamics at discrete times, where higher-order derivatives emphasize higher-order motion features. ViDiDi learns a single neural network that encodes a video and its temporal derivatives into consistent embeddings following a balanced alternating learning algorithm. By learning consistent representations for original frames and derivatives, the encoder is steered to emphasize motion features over static backgrounds and uncover the hidden dynamics in original frames. Hence, video representations are better separated by dynamic features. We integrate ViDiDi into existing instance discrimination frameworks (VICReg, BYOL, and SimCLR) for pretraining on UCF101 or Kinetics and test on standard benchmarks including video retrieval, action recognition, and action detection. The performances are enhanced by a significant margin without the need for large models or extensive datasets.
☆ Pluralistic Salient Object Detection
We introduce pluralistic salient object detection (PSOD), a novel task aimed at generating multiple plausible salient segmentation results for a given input image. Unlike conventional SOD methods that produce a single segmentation mask for salient objects, this new setting recognizes the inherent complexity of real-world images, comprising multiple objects, and the ambiguity in defining salient objects due to different user intentions. To study this task, we present two new SOD datasets "DUTS-MM" and "DUS-MQ", along with newly designed evaluation metrics. DUTS-MM builds upon the DUTS dataset but enriches the ground-truth mask annotations from three aspects which 1) improves the mask quality especially for boundary and fine-grained structures; 2) alleviates the annotation inconsistency issue; and 3) provides multiple ground-truth masks for images with saliency ambiguity. DUTS-MQ consists of approximately 100K image-mask pairs with human-annotated preference scores, enabling the learning of real human preferences in measuring mask quality. Building upon these two datasets, we propose a simple yet effective pluralistic SOD baseline based on a Mixture-of-Experts (MOE) design. Equipped with two prediction heads, it simultaneously predicts multiple masks using different query prompts and predicts human preference scores for each mask candidate. Extensive experiments and analyses underscore the significance of our proposed datasets and affirm the effectiveness of our PSOD framework.
♻ ☆ LADDER: Language Driven Slice Discovery and Error Rectification
Error slice discovery associates structured patterns with model errors. Existing methods discover error slices by clustering the error-prone samples with similar patterns or assigning discrete attributes to each sample for post-hoc analysis. While these methods aim for interpretability and easier mitigation through reweighting or rebalancing, they may not capture the full complexity of error patterns due to incomplete or missing attributes. Contrary to the existing approach, this paper utilizes the reasoning capabilities of the Large Language Model (LLM) to analyze complex error patterns and generate testable hypotheses. This paper proposes LADDER: Language Driven slice Discovery and Error Rectification. It first projects the model's representation into a language-aligned feature space (eg CLIP) to preserve semantics in the original model feature space. This ensures the accurate retrieval of sentences that highlight the model's errors. Next, the LLM utilizes the sentences and generates hypotheses to discover error slices. Finally, we mitigate the error by fine-tuning the classification head by creating a group-balanced dataset using the hypotheses. Our entire method does not require any attribute annotation, either explicitly or through external tagging models. We validate our method with \textbf{five} image classification datasets. The code is available (https://github.com/batmanlab/Ladder).
♻ ☆ Quantifying uncertainty in lung cancer segmentation with foundation models applied to mixed-domain datasets
Medical image foundation models have shown the ability to segment organs and tumors with minimal fine-tuning. These models are typically evaluated on task-specific in-distribution (ID) datasets. However, reliable performance on ID dataset does not guarantee robust generalization on out-of-distribution (OOD) datasets. Importantly, once deployed for clinical use, it is impractical to have `ground truth' delineations to assess ongoing performance drifts, especially when images fall into OOD category due to different imaging protocols. Hence, we introduced a comprehensive set of computationally fast metrics to evaluate the performance of multiple foundation models (Swin UNETR, SimMIM, iBOT, SMIT) trained with self-supervised learning (SSL). SSL pretraining was selected as this approach is applicable for large, diverse, and unlabeled image sets. All models were fine-tuned on identical datasets for lung tumor segmentation from computed tomography (CT) scans. SimMIM, iBOT, and SMIT used identical architecture, pretraining, and fine-tuning datasets to assess performance variations with the choice of pretext tasks used in SSL. Evaluation was performed on two public lung cancer datasets (LRAD: n = 140, 5Rater: n = 21) with different image acquisitions and tumor stage compared to training data (n = 317 public resource with stage III-IV lung cancers) and a public non-cancer dataset containing volumetric CT scans of patients with pulmonary embolism (n = 120). All models produced similarly accurate tumor segmentation on the lung cancer testing datasets. SMIT produced a highest F1-score (LRAD: 0.60, 5Rater: 0.64) and lowest entropy (LRAD: 0.06, 5Rater: 0.12), indicating higher tumor detection rate and confident segmentations. In the OOD dataset, SMIT misdetected least number of tumors, indicated by median volume occupancy of 5.67 cc compared to second best method SimMIM of 9.97 cc.
♻ ☆ Multi-task Learning Approach for Intracranial Hemorrhage Prognosis MICCAI 2024
Prognosis after intracranial hemorrhage (ICH) is influenced by a complex interplay between imaging and tabular data. Rapid and reliable prognosis are crucial for effective patient stratification and informed treatment decision-making. In this study, we aim to enhance image-based prognosis by learning a robust feature representation shared between prognosis and the clinical and demographic variables most highly correlated with it. Our approach mimics clinical decision-making by reinforcing the model to learn valuable prognostic data embedded in the image. We propose a 3D multi-task image model to predict prognosis, Glasgow Coma Scale and age, improving accuracy and interpretability. Our method outperforms current state-of-the-art baseline image models, and demonstrates superior performance in ICH prognosis compared to four board-certified neuroradiologists using only CT scans as input. We further validate our model with interpretability saliency maps. Code is available at https://github.com/MiriamCobo/MultitaskLearning_ICH_Prognosis.git.
comment: 16 pages. Accepted at Machine Learning in Medical Imaging Workshop @ MICCAI 2024 (MLMI2024). This is the submitted manuscript with added link to github repo, funding acknowledgements and authors' names and affiliations. No further post submission improvements or corrections were integrated. Final version not published yet
♻ ☆ SDE-based Multiplicative Noise Removal
Multiplicative noise, also known as speckle or pepper noise, commonly affects images produced by synthetic aperture radar (SAR), lasers, or optical lenses. Unlike additive noise, which typically arises from thermal processes or external factors, multiplicative noise is inherent to the system, originating from the fluctuation in diffuse reflections. These fluctuations result in multiple copies of the same signal with varying magnitudes being combined. Consequently, despeckling, or removing multiplicative noise, necessitates different techniques compared to those used for additive noise removal. In this paper, we propose a novel approach using Stochastic Differential Equations based diffusion models to address multiplicative noise. We demonstrate that multiplicative noise can be effectively modeled as a Geometric Brownian Motion process in the logarithmic domain. Utilizing the Fokker-Planck equation, we derive the corresponding reverse process for image denoising. To validate our method, we conduct extensive experiments on two different datasets, comparing our approach to both classical signal processing techniques and contemporary CNN-based noise removal models. Our results indicate that the proposed method significantly outperforms existing methods on perception-based metrics such as FID and LPIPS, while maintaining competitive performance on traditional metrics like PSNR and SSIM.
comment: 9 pages, 4 figures
♻ ☆ Learning Local Pattern Modularization for Point Cloud Reconstruction from Unseen Classes ECCV 2024
It is challenging to reconstruct 3D point clouds in unseen classes from single 2D images. Instead of object-centered coordinate system, current methods generalized global priors learned in seen classes to reconstruct 3D shapes from unseen classes in viewer-centered coordinate system. However, the reconstruction accuracy and interpretability are still eager to get improved. To resolve this issue, we introduce to learn local pattern modularization for reconstructing 3D shapes in unseen classes, which achieves both good generalization ability and high reconstruction accuracy. Our insight is to learn a local prior which is class-agnostic and easy to generalize in object-centered coordinate system. Specifically, the local prior is learned via a process of learning and customizing local pattern modularization in seen classes. During this process, we first learn a set of patterns in local regions, which is the basis in the object-centered coordinate system to represent an arbitrary region on shapes across different classes. Then, we modularize each region on an initially reconstructed shape using the learned local patterns. Based on that, we customize the local pattern modularization using the input image by refining the reconstruction with more details. Our method enables to reconstruct high fidelity point clouds from unseen classes in object-centered coordinate system without requiring a large number of patterns or any additional information, such as segmentation supervision or camera poses. Our experimental results under widely used benchmarks show that our method achieves the state-of-the-art reconstruction accuracy for shapes from unseen classes. The code is available at https://github.com/chenchao15/Unseen.
comment: 14pages, 11figures, accepted by ECCV 2024
♻ ☆ CONDA: Condensed Deep Association Learning for Co-Salient Object Detection
Inter-image association modeling is crucial for co-salient object detection. Despite satisfactory performance, previous methods still have limitations on sufficient inter-image association modeling. Because most of them focus on image feature optimization under the guidance of heuristically calculated raw inter-image associations. They directly rely on raw associations which are not reliable in complex scenarios, and their image feature optimization approach is not explicit for inter-image association modeling. To alleviate these limitations, this paper proposes a deep association learning strategy that deploys deep networks on raw associations to explicitly transform them into deep association features. Specifically, we first create hyperassociations to collect dense pixel-pair-wise raw associations and then deploys deep aggregation networks on them. We design a progressive association generation module for this purpose with additional enhancement of the hyperassociation calculation. More importantly, we propose a correspondence-induced association condensation module that introduces a pretext task, i.e. semantic correspondence estimation, to condense the hyperassociations for computational burden reduction and noise elimination. We also design an object-aware cycle consistency loss for high-quality correspondence estimations. Experimental results in three benchmark datasets demonstrate the remarkable effectiveness of our proposed method with various training settings.
comment: There is an error. In Sec 4.1, the number of images in some dataset is incorrect and needs to be revised
♻ ☆ Open Gaze: Open Source eye tracker for smartphone devices using Deep Learning
Eye tracking has been a pivotal tool in diverse fields such as vision research, language analysis, and usability assessment. The majority of prior investigations, however, have concentrated on expansive desktop displays employing specialized, costly eye tracking hardware that lacks scalability. Remarkably little insight exists into ocular movement patterns on smartphones, despite their widespread adoption and significant usage. In this manuscript, we present an open-source implementation of a smartphone-based gaze tracker that emulates the methodology proposed by a GooglePaper (whose source code remains proprietary). Our focus is on attaining accuracy comparable to that attained through the GooglePaper's methodology, without the necessity for supplementary hardware. Through the integration of machine learning techniques, we unveil an accurate eye tracking solution that is native to smartphones. Our approach demonstrates precision akin to the state-of-the-art mobile eye trackers, which are characterized by a cost that is two orders of magnitude higher. Leveraging the vast MIT GazeCapture dataset, which is available through registration on the dataset's website, we successfully replicate crucial findings from previous studies concerning ocular motion behavior in oculomotor tasks and saliency analyses during natural image observation. Furthermore, we emphasize the applicability of smartphone-based gaze tracking in discerning reading comprehension challenges. Our findings exhibit the inherent potential to amplify eye movement research by significant proportions, accommodating participation from thousands of subjects with explicit consent. This scalability not only fosters advancements in vision research, but also extends its benefits to domains such as accessibility enhancement and healthcare applications.
comment: This paper results are incorrectly reported. The paper is not authentic and conclusions are not correct
♻ ☆ Q-Seg: Quantum Annealing-Based Unsupervised Image Segmentation
We present Q-Seg, a novel unsupervised image segmentation method based on quantum annealing, tailored for existing quantum hardware. We formulate the pixel-wise segmentation problem, which assimilates spectral and spatial information of the image, as a graph-cut optimization task. Our method efficiently leverages the interconnected qubit topology of the D-Wave Advantage device, offering superior scalability over existing quantum approaches and outperforming several tested state-of-the-art classical methods. Empirical evaluations on synthetic datasets have shown that Q-Seg has better runtime performance than the state-of-the-art classical optimizer Gurobi. The method has also been tested on earth observation image segmentation, a critical area with noisy and unreliable annotations. In the era of noisy intermediate-scale quantum, Q-Seg emerges as a reliable contender for real-world applications in comparison to advanced techniques like Segment Anything. Consequently, Q-Seg offers a promising solution using available quantum hardware, especially in situations constrained by limited labeled data and the need for efficient computational runtime.
comment: 12 pages, 9 figures, 1 table
♻ ☆ Enhancing the vision-language foundation model with key semantic knowledge-emphasized report refinement
Recently, vision-language representation learning has made remarkable advancements in building up medical foundation models, holding immense potential for transforming the landscape of clinical research and medical care. The underlying hypothesis is that the rich knowledge embedded in radiology reports can effectively assist and guide the learning process, reducing the need for additional labels. However, these reports tend to be complex and sometimes even consist of redundant descriptions that make the representation learning too challenging to capture the key semantic information. This paper develops a novel iterative vision-language representation learning framework by proposing a key semantic knowledge-emphasized report refinement method. Particularly, raw radiology reports are refined to highlight the key information according to a constructed clinical dictionary and two model-optimized knowledge-enhancement metrics. The iterative framework is designed to progressively learn, starting from gaining a general understanding of the patient's condition based on raw reports and gradually refines and extracts critical information essential to the fine-grained analysis tasks. The effectiveness of the proposed framework is validated on various downstream medical image analysis tasks, including disease classification, region-of-interest segmentation, and phrase grounding. Our framework surpasses seven state-of-the-art methods in both fine-tuning and zero-shot settings, demonstrating its encouraging potential for different clinical applications.
♻ ☆ Pre-processing and Compression: Understanding Hidden Representation Refinement Across Imaging Domains via Intrinsic Dimension
In recent years, there has been interest in how geometric properties such as intrinsic dimension (ID) of a neural network's hidden representations change through its layers, and how such properties are predictive of important model behavior such as generalization ability. However, evidence has begun to emerge that such behavior can change significantly depending on the domain of the network's training data, such as natural versus medical images. Here, we further this inquiry by exploring how the ID of a network's learned representations changes through its layers, in essence, characterizing how the network successively refines the information content of input data to be used for predictions. Analyzing eleven natural and medical image datasets across six network architectures, we find that how ID changes through the network differs noticeably between natural and medical image models. Specifically, medical image models peak in representation ID earlier in the network, implying a difference in the image features and their abstractness that are typically used for downstream tasks in these domains. Additionally, we discover a strong correlation of this peak representation ID with the ID of the data in its input space, implying that the intrinsic information content of a model's learned representations is guided by that of the data it was trained on. Overall, our findings emphasize notable discrepancies in network behavior between natural and non-natural imaging domains regarding hidden representation information content, and provide further insights into how a network's learned features are shaped by its training data.
♻ ☆ CHOTA: A Higher Order Accuracy Metric for Cell Tracking
The evaluation of cell tracking results steers the development of tracking methods, significantly impacting biomedical research. This is quantitatively achieved by means of evaluation metrics. Unfortunately, current metrics favor local correctness and weakly reward global coherence, impeding high-level biological analysis. To also foster global coherence, we propose the CHOTA metric (Cell-specific Higher Order Tracking Accuracy) which unifies the evaluation of all relevant aspects of cell tracking: cell detections and local associations, global coherence, and lineage tracking. We achieve this by introducing a new definition of the term 'trajectory' that includes the entire cell lineage and by including this into the well-established HOTA metric from general multiple object tracking. Furthermore, we provide a detailed survey of contemporary cell tracking metrics to compare our novel CHOTA metric and to show its advantages. All metrics are extensively evaluated on state-of-the-art real-data cell tracking results and synthetic results that simulate specific tracking errors. We show that CHOTA is sensitive to all tracking errors and gives a good indication of the biologically relevant capability of a method to reconstruct the full lineage of cells. It introduces a robust and comprehensive alternative to the currently used metrics in cell tracking. Python code is available at https://github.com/CellTrackingChallenge/py-ctcmetrics .
comment: Accepted at BIC Workshop at European Conference on Computer Vision 2024, 14 pages, 4 figures, 2 tables
♻ ☆ Large Scale Unsupervised Brain MRI Image Registration Solution for Learn2Reg 2024 MICCAI
In this paper, we summarize the methods and experimental results we proposed for Task 2 in the learn2reg 2024 Challenge. This task focuses on unsupervised registration of anatomical structures in brain MRI images between different patients. The difficulty lies in: (1) without segmentation labels, and (2) a large amount of data. To address these challenges, we built an efficient backbone network and explored several schemes to further enhance registration accuracy. Under the guidance of the NCC loss function and smoothness regularization loss function, we obtained a smooth and reasonable deformation field. According to the leaderboard, our method achieved a Dice coefficient of 77.34%, which is 1.4% higher than the TransMorph. Overall, we won second place on the leaderboard for Task 2.
comment: MICCAI Learn2Reg 2024 Challenge & WBIR 2024 Workshop on Biomedical Imaging Registration
♻ ☆ Nickel and Diming Your GAN: A Dual-Method Approach to Enhancing GAN Efficiency via Knowledge Distillation
In this paper, we address the challenge of compressing generative adversarial networks (GANs) for deployment in resource-constrained environments by proposing two novel methodologies: Distribution Matching for Efficient compression (DiME) and Network Interactive Compression via Knowledge Exchange and Learning (NICKEL). DiME employs foundation models as embedding kernels for efficient distribution matching, leveraging maximum mean discrepancy to facilitate effective knowledge distillation. Simultaneously, NICKEL employs an interactive compression method that enhances the communication between the student generator and discriminator, achieving a balanced and stable compression process. Our comprehensive evaluation on the StyleGAN2 architecture with the FFHQ dataset shows the effectiveness of our approach, with NICKEL & DiME achieving FID scores of 10.45 and 15.93 at compression rates of 95.73% and 98.92%, respectively. Remarkably, our methods sustain generative quality even at an extreme compression rate of 99.69%, surpassing the previous state-of-the-art performance by a large margin. These findings not only demonstrate our methodologies' capacity to significantly lower GANs' computational demands but also pave the way for deploying high-quality GAN models in settings with limited resources. Our code will be released soon.
♻ ☆ When Does Visual Prompting Outperform Linear Probing for Vision-Language Models? A Likelihood Perspective
Adapting pre-trained models to new tasks can exhibit varying effectiveness across datasets. Visual prompting, a state-of-the-art parameter-efficient transfer learning method, can significantly improve the performance of out-of-distribution tasks. On the other hand, linear probing, a standard transfer learning method, can sometimes become the best approach. We propose a log-likelihood ratio (LLR) approach to analyze the comparative benefits of visual prompting and linear probing. By employing the LLR score alongside resource-efficient visual prompts approximations, our cost-effective measure attains up to a 100-fold reduction in run time compared to full training, while achieving prediction accuracies up to 91%. The source code is available at https://github.com/IBM/VP-LLR.
♻ ☆ MMA-MRNNet: Harnessing Multiple Models of Affect and Dynamic Masked RNN for Precise Facial Expression Intensity Estimation
This paper presents MMA-MRNNet, a novel deep learning architecture for dynamic multi-output Facial Expression Intensity Estimation (FEIE) from video data. Traditional approaches to this task often rely on complex 3-D CNNs, which require extensive pre-training and assume that facial expressions are uniformly distributed across all frames of a video. These methods struggle to handle videos of varying lengths, often resorting to ad-hoc strategies that either discard valuable information or introduce bias. MMA-MRNNet addresses these challenges through a two-stage process. First, the Multiple Models of Affect (MMA) extractor component is a Multi-Task Learning CNN that concurrently estimates valence-arousal, recognizes basic facial expressions, and detects action units in each frame. These representations are then processed by a Masked RNN component, which captures temporal dependencies and dynamically updates weights according to the true length of the input video, ensuring that only the most relevant features are used for the final prediction. The proposed unimodal non-ensemble learning MMA-MRNNet was evaluated on the Hume-Reaction dataset and demonstrated significantly superior performance, surpassing state-of-the-art methods by a wide margin, regardless of whether they were unimodal, multimodal, or ensemble approaches. Finally, we demonstrated the effectiveness of the MMA component of our proposed method across multiple in-the-wild datasets, where it consistently outperformed all state-of-the-art methods across various metrics.
♻ ☆ In the Search for Optimal Multi-view Learning Models for Crop Classification with Global Remote Sensing Data
Studying and analyzing cropland is a difficult task due to its dynamic and heterogeneous growth behavior. Usually, diverse data sources can be collected for its estimation. Although deep learning models have proven to excel in the crop classification task, they face substantial challenges when dealing with multiple inputs, named Multi-View Learning (MVL). The methods used in the MVL scenario can be structured based on the encoder architecture, the fusion strategy, and the optimization technique. The literature has primarily focused on using specific encoder architectures for local regions, lacking a deeper exploration of other components in the MVL methodology. In contrast, we investigate the simultaneous selection of the fusion strategy and encoder architecture, assessing global-scale cropland and crop-type classifications. We use a range of five fusion strategies (Input, Feature, Decision, Ensemble, Hybrid) and five temporal encoders (LSTM, GRU, TempCNN, TAE, L-TAE) as possible configurations in the MVL method. We use the CropHarvest dataset for validation, which provides optical, radar, weather time series, and topographic information as input data. We found that in scenarios with a limited number of labeled samples, a unique configuration is insufficient for all the cases. Instead, a specialized combination should be meticulously sought, including an encoder and fusion strategy. To streamline this search process, we suggest identifying the optimal encoder architecture tailored for a particular fusion strategy, and then determining the most suitable fusion strategy for the classification task. We provide a methodological framework for researchers exploring crop classification through an MVL methodology.
comment: submitted to journal
♻ ☆ Increasing the Robustness of Model Predictions to Missing Sensors in Earth Observation ACL
Multi-sensor ML models for EO aim to enhance prediction accuracy by integrating data from various sources. However, the presence of missing data poses a significant challenge, particularly in non-persistent sensors that can be affected by external factors. Existing literature has explored strategies like temporal dropout and sensor-invariant models to address the generalization to missing data issues. Inspired by these works, we study two novel methods tailored for multi-sensor scenarios, namely Input Sensor Dropout (ISensD) and Ensemble Sensor Invariant (ESensI). Through experimentation on three multi-sensor temporal EO datasets, we demonstrate that these methods effectively increase the robustness of model predictions to missing sensors. Particularly, we focus on how the predictive performance of models drops when sensors are missing at different levels. We observe that ensemble multi-sensor models are the most robust to the lack of sensors. In addition, the sensor dropout component in ISensD shows promising robustness results.
comment: Accepted at the MACLEAN workshop in the ECML/PKDD 2024
♻ ☆ Scalable Glacier Mapping using Deep Learning and Open Earth Observation Data Matches the Accuracy of Manual Delineation
Accurate global glacier mapping is critical for understanding climate change impacts. Despite its importance, automated glacier mapping at a global scale remains largely unexplored. Here we address this gap and propose Glacier-VisionTransformer-U-Net (GlaViTU), a convolutional-transformer deep learning model, and five strategies for multitemporal global-scale glacier mapping using open satellite imagery. Assessing the spatial, temporal and cross-sensor generalisation shows that our best strategy achieves intersection over union >0.85 on previously unobserved images in most cases, which drops to >0.75 for debris-rich areas such as High-Mountain Asia and increases to >0.90 for regions dominated by clean ice. A comparative validation against human expert uncertainties in terms of area and distance deviations underscores GlaViTU performance, approaching or matching expert-level delineation. Adding synthetic aperture radar data, namely, backscatter and interferometric coherence, increases the accuracy in all regions where available. The calibrated confidence for glacier extents is reported making the predictions more reliable and interpretable. We also release a benchmark dataset that covers 9% of glaciers worldwide. Our results support efforts towards automated multitemporal and global glacier mapping.
comment: after major revision, expanded validation
♻ ☆ CSGO: Content-Style Composition in Text-to-Image Generation
The diffusion model has shown exceptional capabilities in controlled image generation, which has further fueled interest in image style transfer. Existing works mainly focus on training free-based methods (e.g., image inversion) due to the scarcity of specific data. In this study, we present a data construction pipeline for content-style-stylized image triplets that generates and automatically cleanses stylized data triplets. Based on this pipeline, we construct a dataset IMAGStyle, the first large-scale style transfer dataset containing 210k image triplets, available for the community to explore and research. Equipped with IMAGStyle, we propose CSGO, a style transfer model based on end-to-end training, which explicitly decouples content and style features employing independent feature injection. The unified CSGO implements image-driven style transfer, text-driven stylized synthesis, and text editing-driven stylized synthesis. Extensive experiments demonstrate the effectiveness of our approach in enhancing style control capabilities in image generation. Additional visualization and access to the source code can be located on the project page: \url{https://csgo-gen.github.io/}.
♻ ☆ Object-Size-Driven Design of Convolutional Neural Networks: Virtual Axle Detection based on Raw Data
As infrastructure ages, the need for efficient monitoring methods becomes increasingly critical. Bridge Weigh-In-Motion (BWIM) systems are crucial for cost-efficient load and thus residual service life determination of road and railway infrastructure. However, conventional BWIM systems require additional sensors for axle detection, which have to be installed in potentially inaccessible locations or in locations that interfere with bridge operation. This study addresses this challenge by replacing dedicated axle detectors with a novel approach to real-time detection of train axles using sensors arbitrarily placed on bridges. The proposed Virtual Axle Detector with Enhanced Receptive Field (VADER) has been validated on a single-track railway bridge, demonstrating that it achieves to detect 99.9% of axles with a spatial error of 3.69cm using only acceleration measurements. Using raw data as input outperforms the state-of-the-art spectrogram-based method in both speed and memory usage by 99%, making real-time application feasible for the first time. Additionally, we introduce the Maximum Receptive Field (MRF) rule, a novel approach to optimise hyperparameters of Convolutional Neural Networks (CNNs) based on the size of objects, which in this case relates to the fundamental frequency of a bridge. The MRF rule effectively narrows the hyperparameter search space, potentially replacing the need for extensive hyperparameter tuning. Since the MRF rule is theoretically applicable to all unstructured data, it could have implications for a wide range of deep learning problems from earthquake prediction to object recognition.
♻ ☆ Filter & Align: Leveraging Human Knowledge to Curate Image-Text Data
The increasing availability of image-text pairs has largely fueled the rapid advancement in vision-language foundation models. However, the vast scale of these datasets inevitably introduces significant variability in data quality, which can adversely affect the model performance. This highlights the critical role of data filtering, not only to enhance training efficiency but also to improve overall data quality. Existing methods typically rely on metrics such as CLIP Score and BLIP Score, which are derived from pre-trained models. However, these models are often trained on uncurated, noisy datasets, which can perpetuate errors and misalignments in the filtered dataset. We present a novel algorithm that incorporates human knowledge on image-text alignment to guide filtering vast corpus of web-crawled image-text datasets into a compact and high-quality form. To systemically capture human preferences on image-text alignments, we collect a diverse image-text dataset where each image is associated with multiple captions from various sources, and establish a comprehensive set of both subjective and objective criteria for critically guiding the alignment assessment from labelers. Additionally, we train a reward model on these human-preference annotations to internalize the nuanced human understanding of image-text alignment. The resulting reward model thus can act as a human-like referee to filter image-text pairs. Extensive experiments demonstrate that we can maintain, sometimes even improve, model performance while compressing the image-text datasets up to ~90%. An impressive example is that, by aggressively reducing the total training sample from 130M to only 15.5M, our BLIP-B/16 models consistently show an average improvement of 2.9% on retrieval tasks and 11.5% on captioning tasks compared to full-size-dataset counterparts.
♻ ☆ Multi-Task Multi-Modal Self-Supervised Learning for Facial Expression Recognition CVPR 2024
Human communication is multi-modal; e.g., face-to-face interaction involves auditory signals (speech) and visual signals (face movements and hand gestures). Hence, it is essential to exploit multiple modalities when designing machine learning-based facial expression recognition systems. In addition, given the ever-growing quantities of video data that capture human facial expressions, such systems should utilize raw unlabeled videos without requiring expensive annotations. Therefore, in this work, we employ a multitask multi-modal self-supervised learning method for facial expression recognition from in-the-wild video data. Our model combines three self-supervised objective functions: First, a multi-modal contrastive loss, that pulls diverse data modalities of the same video together in the representation space. Second, a multi-modal clustering loss that preserves the semantic structure of input data in the representation space. Finally, a multi-modal data reconstruction loss. We conduct a comprehensive study on this multimodal multi-task self-supervised learning method on three facial expression recognition benchmarks. To that end, we examine the performance of learning through different combinations of self-supervised tasks on the facial expression recognition downstream task. Our model ConCluGen outperforms several multi-modal self-supervised and fully supervised baselines on the CMU-MOSEI dataset. Our results generally show that multi-modal self-supervision tasks offer large performance gains for challenging tasks such as facial expression recognition, while also reducing the amount of manual annotations required. We release our pre-trained models as well as source code publicly
comment: The paper will appear in the CVPR 2024 workshops proceedings
♻ ☆ UHD-IQA Benchmark Database: Pushing the Boundaries of Blind Photo Quality Assessment
We introduce a novel Image Quality Assessment (IQA) dataset comprising 6073 UHD-1 (4K) images, annotated at a fixed width of 3840 pixels. Contrary to existing No-Reference (NR) IQA datasets, ours focuses on highly aesthetic photos of high technical quality, filling a gap in the literature. The images, carefully curated to exclude synthetic content, are sufficiently diverse to train general NR-IQA models. Importantly, the dataset is annotated with perceptual quality ratings obtained through a crowdsourcing study. Ten expert raters, comprising photographers and graphics artists, assessed each image at least twice in multiple sessions spanning several days, resulting in 20 highly reliable ratings per image. Annotators were rigorously selected based on several metrics, including self-consistency, to ensure their reliability. The dataset includes rich metadata with user and machine-generated tags from over 5,000 categories and popularity indicators such as favorites, likes, downloads, and views. With its unique characteristics, such as its focus on high-quality images, reliable crowdsourced annotations, and high annotation resolution, our dataset opens up new opportunities for advancing perceptual image quality assessment research and developing practical NR-IQA models that apply to modern photos. Our dataset is available at https://database.mmsp-kn.de/uhd-iqa-benchmark-database.html
♻ ☆ Model-agnostic explainable artificial intelligence for object detection in image data
In recent years, deep neural networks have been widely used for building high-performance Artificial Intelligence (AI) systems for computer vision applications. Object detection is a fundamental task in computer vision, which has been greatly progressed through developing large and intricate AI models. However, the lack of transparency is a big challenge that may not allow the widespread adoption of these models. Explainable artificial intelligence is a field of research where methods are developed to help users understand the behavior, decision logics, and vulnerabilities of AI systems. Previously, few explanation methods were developed for object detection based on random masking. However, random masks may raise some issues regarding the actual importance of pixels within an image. In this paper, we design and implement a black-box explanation method named Black-box Object Detection Explanation by Masking (BODEM) through adopting a hierarchical random masking approach for object detection systems. We propose a hierarchical random masking framework in which coarse-grained masks are used in lower levels to find salient regions within an image, and fine-grained mask are used to refine the salient regions in higher levels. Experimentations on various object detection datasets and models showed that BODEM can effectively explain the behavior of object detectors. Moreover, our method outperformed Detector Randomized Input Sampling for Explanation (D-RISE) and Local Interpretable Model-agnostic Explanations (LIME) with respect to different quantitative measures of explanation effectiveness. The experimental results demonstrate that BODEM can be an effective method for explaining and validating object detection systems in black-box testing scenarios.
♻ ☆ Map-Free Visual Relocalization Enhanced by Instance Knowledge and Depth Knowledge
Map-free relocalization technology is crucial for applications in autonomous navigation and augmented reality, but relying on pre-built maps is often impractical. It faces significant challenges due to limitations in matching methods and the inherent lack of scale in monocular images. These issues lead to substantial rotational and metric errors and even localization failures in real-world scenarios. Large matching errors significantly impact the overall relocalization process, affecting both rotational and translational accuracy. Due to the inherent limitations of the camera itself, recovering the metric scale from a single image is crucial, as this significantly impacts the translation error. To address these challenges, we propose a map-free relocalization method enhanced by instance knowledge and depth knowledge. By leveraging instance-based matching information to improve global matching results, our method significantly reduces the possibility of mismatching across different objects. The robustness of instance knowledge across the scene helps the feature point matching model focus on relevant regions and enhance matching accuracy. Additionally, we use estimated metric depth from a single image to reduce metric errors and improve scale recovery accuracy. By integrating methods dedicated to mitigating large translational and rotational errors, our approach demonstrates superior performance in map-free relocalization techniques.
comment: 17 pages,6 figures
♻ ☆ CT-AGRG: Automated Abnormality-Guided Report Generation from 3D Chest CT Volumes
The rapid increase of computed tomography (CT) scans and their time-consuming manual analysis have created an urgent need for robust automated analysis techniques in clinical settings. These aim to assist radiologists and help them managing their growing workload. Existing methods typically generate entire reports directly from 3D CT images, without explicitly focusing on observed abnormalities. This unguided approach often results in repetitive content or incomplete reports, failing to prioritize anomaly-specific descriptions. We propose a new anomaly-guided report generation model, which first predicts abnormalities and then generates targeted descriptions for each. Evaluation on a public dataset demonstrates significant improvements in report quality and clinical relevance. We extend our work by conducting an ablation study to demonstrate its effectiveness.
comment: 15 pages, 9 figures, submitted to ISBI 2025
♻ ☆ Path-SAM2: Transfer SAM2 for digital pathology semantic segmentation
The semantic segmentation task in pathology plays an indispensable role in assisting physicians in determining the condition of tissue lesions. With the proposal of Segment Anything Model (SAM), more and more foundation models have seen rapid development in the field of image segmentation. Recently, SAM2 has garnered widespread attention in both natural image and medical image segmentation. Compared to SAM, it has significantly improved in terms of segmentation accuracy and generalization performance. We compared the foundational models based on SAM and found that their performance in semantic segmentation of pathological images was hardly satisfactory. In this paper, we propose Path-SAM2, which for the first time adapts the SAM2 model to cater to the task of pathological semantic segmentation. We integrate the largest pretrained vision encoder for histopathology (UNI) with the original SAM2 encoder, adding more pathology-based prior knowledge. Additionally, we introduce a learnable Kolmogorov-Arnold Networks (KAN) classification module to replace the manual prompt process. In three adenoma pathological datasets, Path-SAM2 has achieved state-of-the-art performance.This study demonstrates the great potential of adapting SAM2 to pathology image segmentation tasks. We plan to release the code and model weights for this paper at: https://github.com/simzhangbest/SAM2PATH
comment: 5 pages , 5 figures
♻ ☆ Bayesian Evidential Learning for Few-Shot Classification
Few-Shot Classification(FSC) aims to generalize from base classes to novel classes given very limited labeled samples, which is an important step on the path toward human-like machine learning. State-of-the-art solutions involve learning to find a good metric and representation space to compute the distance between samples. Despite the promising accuracy performance, how to model uncertainty for metric-based FSC methods effectively is still a challenge. To model uncertainty, We place a distribution over class probability based on the theory of evidence. As a result, uncertainty modeling and metric learning can be decoupled. To reduce the uncertainty of classification, we propose a Bayesian evidence fusion theorem. Given observed samples, the network learns to get posterior distribution parameters given the prior parameters produced by the pre-trained network. Detailed gradient analysis shows that our method provides a smooth optimization target and can capture the uncertainty. The proposed method is agnostic to metric learning strategies and can be implemented as a plug-and-play module. We integrate our method into several newest FSC methods and demonstrate the improved accuracy and uncertainty quantification on standard FSC benchmarks.
comment: 15 pages
♻ ☆ AI-Assisted Cervical Cancer Screening
Visual Inspection with Acetic Acid (VIA) remains the most feasible cervical cancer screening test in resource-constrained settings of low- and middle-income countries (LMICs), which are often performed screening camps or primary/community health centers by nurses instead of the preferred but unavailable expert Gynecologist. To address the highly subjective nature of the test, various handheld devices integrating cameras or smartphones have been recently explored to capture cervical images during VIA and aid decision-making via telemedicine or AI models. Most studies proposing AI models retrospectively use a relatively small number of already collected images from specific devices, digital cameras, or smartphones; the challenges and protocol for quality image acquisition during VIA in resource-constrained camp settings, challenges in getting gold standard, data imbalance, etc. are often overlooked. We present a novel approach and describe the end-to-end design process to build a robust smartphone-based AI-assisted system that does not require buying a separate integrated device: the proposed protocol for quality image acquisition in resource-constrained settings, dataset collected from 1,430 women during VIA performed by nurses in screening camps, preprocessing pipeline, and training and evaluation of a deep-learning-based classification model aimed to identify (pre)cancerous lesions. Our work shows that the readily available smartphones and a suitable protocol can capture the cervix images with the required details for the VIA test well; the deep-learning-based classification model provides promising results to assist nurses in VIA screening; and provides a direction for large-scale data collection and validation in resource-constrained settings.
♻ ☆ Style-NeRF2NeRF: 3D Style Transfer From Style-Aligned Multi-View Images
We propose a simple yet effective pipeline for stylizing a 3D scene, harnessing the power of 2D image diffusion models. Given a NeRF model reconstructed from a set of multi-view images, we perform 3D style transfer by refining the source NeRF model using stylized images generated by a style-aligned image-to-image diffusion model. Given a target style prompt, we first generate perceptually similar multi-view images by leveraging a depth-conditioned diffusion model with an attention-sharing mechanism. Next, based on the stylized multi-view images, we propose to guide the style transfer process with the sliced Wasserstein loss based on the feature maps extracted from a pre-trained CNN model. Our pipeline consists of decoupled steps, allowing users to test various prompt ideas and preview the stylized 3D result before proceeding to the NeRF fine-tuning stage. We demonstrate that our method can transfer diverse artistic styles to real-world 3D scenes with competitive quality. Result videos are also available on our project page: https://haruolabs.github.io/style-n2n/
comment: 16 pages, 9 figures
♻ ☆ Rethinking Barely-Supervised Volumetric Medical Image Segmentation from an Unsupervised Domain Adaptation Perspective
This paper investigates an extremely challenging problem: barely-supervised volumetric medical image segmentation (BSS). A BSS training dataset consists of two parts: 1) a barely-annotated labeled set, where each labeled image contains only a single-slice annotation, and 2) an unlabeled set comprising numerous unlabeled volumetric images. State-of-the-art BSS methods employ a registration-based paradigm, which uses inter-slice image registration to propagate single-slice annotations into volumetric pseudo labels, constructing a completely annotated labeled set, to which a semi-supervised segmentation scheme can be applied. However, the paradigm has a critical limitation: the pseudo-labels generated by image registration are unreliable and noisy. Motivated by this, we propose a new perspective: instead of solving BSS within a semi-supervised learning scheme, this work formulates BSS as an unsupervised domain adaptation problem. To this end, we propose a novel BSS framework, \textbf{B}arely-supervised learning \textbf{via} unsupervised domain \textbf{A}daptation (BvA), as an alternative to the dominant registration paradigm. Specifically, we first design a novel noise-free labeled data construction algorithm (NFC) for slice-to-volume labeled data synthesis. Then, we introduce a frequency and spatial Mix-Up strategy (FSX) to mitigate the domain shifts. Extensive experiments demonstrate that our method provides a promising alternative for BSS. Remarkably, the proposed method, trained on the left atrial segmentation dataset with \textbf{only one} barely-labeled image, achieves a Dice score of 81.20%, outperforming the state-of-the-art by 61.71%. The code is available at https://github.com/Senyh/BvA.
♻ ☆ Zero-shot 3D Segmentation of Abdominal Organs in CT Scans Using Segment Anything Model 2: Adapting Video Tracking Capabilities for 3D Medical Imaging
Purpose: To evaluate the zero-shot performance of Segment Anything Model 2 (SAM 2) in 3D segmentation of abdominal organs in CT scans, and to investigate the effects of prompt settings on segmentation results. Materials and Methods: Using a subset of the TotalSegmentator CT dataset (n = 123) from eight institutions, we assessed SAM 2's ability to segment eight abdominal organs. Segmentation was initiated from three different z-coordinate levels (caudal, mid, and cranial levels) of each organ. Performance was measured using the Dice similarity coefficient (DSC). We also analyzed the impact of "negative prompts," which explicitly exclude certain regions from the segmentation process, on accuracy. Additionally, we analyzed organ volumes to contextualize the segmentation performance. Results: As a zero-shot approach, larger organs with clear boundaries demonstrated high segmentation performance, with mean(median) DSCs as follows: liver 0.821(0.898), left kidney 0.870(0.921), right kidney 0.862(0.935), and spleen 0.891(0.932). Smaller organs showed lower performance: gallbladder 0.531(0.590), pancreas 0.361(0.359), and adrenal glands, right 0.203(0.109), left 0.308(0.231). The initial slice for segmentation and the use of negative prompts significantly influenced the results. By removing negative prompts from the input, the DSCs significantly decreased for six organs. Moderate positive correlations were observed between volume sizes and DSCs. Conclusion: SAM 2 demonstrated promising zero-shot performance in segmenting certain abdominal organs in CT scans, particularly larger organs with clear boundaries. Performance was significantly influenced by input negative prompts and initial slice selection, highlighting the importance of optimizing these factors for effective segmentation.
comment: 20 pages, 7 figures (including 2 supplemental figure), 4 tables
♻ ☆ RMT-BVQA: Recurrent Memory Transformer-based Blind Video Quality Assessment for Enhanced Video Content ECCV 2024
With recent advances in deep learning, numerous algorithms have been developed to enhance video quality, reduce visual artifacts, and improve perceptual quality. However, little research has been reported on the quality assessment of enhanced content - the evaluation of enhancement methods is often based on quality metrics that were designed for compression applications. In this paper, we propose a novel blind deep video quality assessment (VQA) method specifically for enhanced video content. It employs a new Recurrent Memory Transformer (RMT) based network architecture to obtain video quality representations, which is optimized through a novel content-quality-aware contrastive learning strategy based on a new database containing 13K training patches with enhanced content. The extracted quality representations are then combined through linear regression to generate video-level quality indices. The proposed method, RMT-BVQA, has been evaluated on the VDPVE (VQA Dataset for Perceptual Video Enhancement) database through a five-fold cross validation. The results show its superior correlation performance when compared to ten existing no-reference quality metrics.
comment: This paper has been accepted by the ECCV 2024 AIM Advances in Image Manipulation workshop
♻ ☆ Group-aware Parameter-efficient Updating for Content-Adaptive Neural Video Compression ACM MM 2024
Content-adaptive compression is crucial for enhancing the adaptability of the pre-trained neural codec for various contents. Although these methods have been very practical in neural image compression (NIC), their application in neural video compression (NVC) is still limited due to two main aspects: 1), video compression relies heavily on temporal redundancy, therefore updating just one or a few frames can lead to significant errors accumulating over time; 2), NVC frameworks are generally more complex, with many large components that are not easy to update quickly during encoding. To address the previously mentioned challenges, we have developed a content-adaptive NVC technique called Group-aware Parameter-Efficient Updating (GPU). Initially, to minimize error accumulation, we adopt a group-aware approach for updating encoder parameters. This involves adopting a patch-based Group of Pictures (GoP) training strategy to segment a video into patch-based GoPs, which will be updated to facilitate a globally optimized domain-transferable solution. Subsequently, we introduce a parameter-efficient delta-tuning strategy, which is achieved by integrating several light-weight adapters into each coding component of the encoding process by both serial and parallel configuration. Such architecture-agnostic modules stimulate the components with large parameters, thereby reducing both the update cost and the encoding time. We incorporate our GPU into the latest NVC framework and conduct comprehensive experiments, whose results showcase outstanding video compression efficiency across four video benchmarks and adaptability of one medical image benchmark.
comment: Accepted by ACM MM 2024, Melbourne, Australia
♻ ☆ CrossDF: Improving Cross-Domain Deepfake Detection with Deep Information Decomposition
Deepfake technology poses a significant threat to security and social trust. Although existing detection methods have shown high performance in identifying forgeries within datasets that use the same deepfake techniques for both training and testing, they suffer from sharp performance degradation when faced with cross-dataset scenarios where unseen deepfake techniques are tested. To address this challenge, we propose a Deep Information Decomposition (DID) framework to enhance the performance of Cross-dataset Deepfake Detection (CrossDF). Unlike most existing deepfake detection methods, our framework prioritizes high-level semantic features over specific visual artifacts. Specifically, it adaptively decomposes facial features into deepfake-related and irrelevant information, only using the intrinsic deepfake-related information for real/fake discrimination. Moreover, it optimizes these two kinds of information to be independent with a de-correlation learning module, thereby enhancing the model's robustness against various irrelevant information changes and generalization ability to unseen forgery methods. Our extensive experimental evaluation and comparison with existing state-of-the-art detection methods validate the effectiveness and superiority of the DID framework on cross-dataset deepfake detection.
♻ ☆ Towards Extreme Image Compression with Latent Feature Guidance and Diffusion Prior
Image compression at extremely low bitrates (below 0.1 bits per pixel (bpp)) is a significant challenge due to substantial information loss. In this work, we propose a novel two-stage extreme image compression framework that exploits the powerful generative capability of pre-trained diffusion models to achieve realistic image reconstruction at extremely low bitrates. In the first stage, we treat the latent representation of images in the diffusion space as guidance, employing a VAE-based compression approach to compress images and initially decode the compressed information into content variables. The second stage leverages pre-trained stable diffusion to reconstruct images under the guidance of content variables. Specifically, we introduce a small control module to inject content information while keeping the stable diffusion model fixed to maintain its generative capability. Furthermore, we design a space alignment loss to force the content variables to align with the diffusion space and provide the necessary constraints for optimization. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art approaches in terms of visual performance at extremely low bitrates. The source code and trained models are available at https://github.com/huai-chang/DiffEIC.
comment: Accepted by IEEE TCSVT
♻ ☆ Robust Semi-supervised Multimodal Medical Image Segmentation via Cross Modality Collaboration
Multimodal learning leverages complementary information derived from different modalities, thereby enhancing performance in medical image segmentation. However, prevailing multimodal learning methods heavily rely on extensive well-annotated data from various modalities to achieve accurate segmentation performance. This dependence often poses a challenge in clinical settings due to limited availability of such data. Moreover, the inherent anatomical misalignment between different imaging modalities further complicates the endeavor to enhance segmentation performance. To address this problem, we propose a novel semi-supervised multimodal segmentation framework that is robust to scarce labeled data and misaligned modalities. Our framework employs a novel cross modality collaboration strategy to distill modality-independent knowledge, which is inherently associated with each modality, and integrates this information into a unified fusion layer for feature amalgamation. With a channel-wise semantic consistency loss, our framework ensures alignment of modality-independent information from a feature-wise perspective across modalities, thereby fortifying it against misalignments in multimodal scenarios. Furthermore, our framework effectively integrates contrastive consistent learning to regulate anatomical structures, facilitating anatomical-wise prediction alignment on unlabeled data in semi-supervised segmentation tasks. Our method achieves competitive performance compared to other multimodal methods across three tasks: cardiac, abdominal multi-organ, and thyroid-associated orbitopathy segmentations. It also demonstrates outstanding robustness in scenarios involving scarce labeled data and misaligned modalities.
♻ ☆ Weakly Supervised Intracranial Hemorrhage Segmentation with YOLO and an Uncertainty Rectified Segment Anything Model
Intracranial hemorrhage (ICH) is a life-threatening condition that requires rapid and accurate diagnosis to improve treatment outcomes and patient survival rates. Recent advancements in supervised deep learning have greatly improved the analysis of medical images, but often rely on extensive datasets with high-quality annotations, which are costly, time-consuming, and require medical expertise to prepare. To mitigate the need for large amounts of expert-prepared segmentation data, we have developed a novel weakly supervised ICH segmentation method that utilizes the YOLO object detection model and an uncertainty-rectified Segment Anything Model (SAM). In addition, we have proposed a novel point prompt generator for this model to further improve segmentation results with YOLO-predicted bounding box prompts. Our approach achieved a high accuracy of 0.933 and an AUC of 0.796 in ICH detection, along with a mean Dice score of 0.629 for ICH segmentation, outperforming existing weakly supervised and popular supervised (UNet and Swin-UNETR) approaches. Overall, the proposed method provides a robust and accurate alternative to the more commonly used supervised techniques for ICH quantification without requiring refined segmentation ground truths during model training.
comment: Manuscript was accepted at SWITCH2024. 10 pages, 2 figures
♻ ☆ Learned Image Transmission with Hierarchical Variational Autoencoder
In this paper, we introduce an innovative hierarchical joint source-channel coding (HJSCC) framework for image transmission, utilizing a hierarchical variational autoencoder (VAE). Our approach leverages a combination of bottom-up and top-down paths at the transmitter to autoregressively generate multiple hierarchical representations of the original image. These representations are then directly mapped to channel symbols for transmission by the JSCC encoder. We extend this framework to scenarios with a feedback link, modeling transmission over a noisy channel as a probabilistic sampling process and deriving a novel generative formulation for JSCC with feedback. Compared with existing approaches, our proposed HJSCC provides enhanced adaptability by dynamically adjusting transmission bandwidth, encoding these representations into varying amounts of channel symbols. Additionally, we introduce a rate attention module to guide the JSCC encoder in optimizing its encoding strategy based on prior information. Extensive experiments on images of varying resolutions demonstrate that our proposed model outperforms existing baselines in rate-distortion performance and maintains robustness against channel noise.
♻ ☆ A Novel Approach to Classify Power Quality Signals Using Vision Transformers
With the rapid integration of electronically interfaced renewable energy resources and loads into smart grids, there is increasing interest in power quality disturbances (PQD) classification to enhance the security and efficiency of these grids. This paper introduces a new approach to PQD classification based on the Vision Transformer (ViT) model. When a PQD occurs, the proposed approach first converts the power quality signal into an image and then utilizes a pre-trained ViT to accurately determine the class of the PQD. Unlike most previous works, which were limited to a few disturbance classes or small datasets, the proposed method is trained and tested on a large dataset with 17 disturbance classes. Our experimental results show that the proposed ViT-based approach achieves PQD classification precision and recall of 98.28% and 97.98%, respectively, outperforming recently proposed techniques applied to the same dataset.
comment: IECON 2024-50th Annual Conference of the IEEE Industrial Electronics Society, Chicago, U.S.A, 2024, pp. 1-6
♻ ☆ Hand1000: Generating Realistic Hands from Text with Only 1,000 Images
Text-to-image generation models have achieved remarkable advancements in recent years, aiming to produce realistic images from textual descriptions. However, these models often struggle with generating anatomically accurate representations of human hands. The resulting images frequently exhibit issues such as incorrect numbers of fingers, unnatural twisting or interlacing of fingers, or blurred and indistinct hands. These issues stem from the inherent complexity of hand structures and the difficulty in aligning textual descriptions with precise visual depictions of hands. To address these challenges, we propose a novel approach named Hand1000 that enables the generation of realistic hand images with target gesture using only 1,000 training samples. The training of Hand1000 is divided into three stages with the first stage aiming to enhance the model's understanding of hand anatomy by using a pre-trained hand gesture recognition model to extract gesture representation. The second stage further optimizes text embedding by incorporating the extracted hand gesture representation, to improve alignment between the textual descriptions and the generated hand images. The third stage utilizes the optimized embedding to fine-tune the Stable Diffusion model to generate realistic hand images. In addition, we construct the first publicly available dataset specifically designed for text-to-hand image generation. Based on the existing hand gesture recognition dataset, we adopt advanced image captioning models and LLaMA3 to generate high-quality textual descriptions enriched with detailed gesture information. Extensive experiments demonstrate that Hand1000 significantly outperforms existing models in producing anatomically correct hand images while faithfully representing other details in the text, such as faces, clothing, and colors.
comment: Project page https://haozhuo-zhang.github.io/Hand1000-project-page/
♻ ☆ MV-VTON: Multi-View Virtual Try-On with Diffusion Models
The goal of image-based virtual try-on is to generate an image of the target person naturally wearing the given clothing. However, existing methods solely focus on the frontal try-on using the frontal clothing. When the views of the clothing and person are significantly inconsistent, particularly when the person's view is non-frontal, the results are unsatisfactory. To address this challenge, we introduce Multi-View Virtual Try-ON (MV-VTON), which aims to reconstruct the dressing results from multiple views using the given clothes. Given that single-view clothes provide insufficient information for MV-VTON, we instead employ two images, i.e., the frontal and back views of the clothing, to encompass the complete view as much as possible. Moreover, we adopt diffusion models that have demonstrated superior abilities to perform our MV-VTON. In particular, we propose a view-adaptive selection method where hard-selection and soft-selection are applied to the global and local clothing feature extraction, respectively. This ensures that the clothing features are roughly fit to the person's view. Subsequently, we suggest joint attention blocks to align and fuse clothing features with person features. Additionally, we collect a MV-VTON dataset MVG, in which each person has multiple photos with diverse views and poses. Experiments show that the proposed method not only achieves state-of-the-art results on MV-VTON task using our MVG dataset, but also has superiority on frontal-view virtual try-on task using VITON-HD and DressCode datasets. Codes and datasets are publicly released at https://github.com/hywang2002/MV-VTON .
comment: Project url: https://hywang2002.github.io/MV-VTON/
♻ ☆ Diffusion-Driven Data Replay: A Novel Approach to Combat Forgetting in Federated Class Continual Learning ECCV 2024
Federated Class Continual Learning (FCCL) merges the challenges of distributed client learning with the need for seamless adaptation to new classes without forgetting old ones. The key challenge in FCCL is catastrophic forgetting, an issue that has been explored to some extent in Continual Learning (CL). However, due to privacy preservation requirements, some conventional methods, such as experience replay, are not directly applicable to FCCL. Existing FCCL methods mitigate forgetting by generating historical data through federated training of GANs or data-free knowledge distillation. However, these approaches often suffer from unstable training of generators or low-quality generated data, limiting their guidance for the model. To address this challenge, we propose a novel method of data replay based on diffusion models. Instead of training a diffusion model, we employ a pre-trained conditional diffusion model to reverse-engineer each class, searching the corresponding input conditions for each class within the model's input space, significantly reducing computational resources and time consumption while ensuring effective generation. Furthermore, we enhance the classifier's domain generalization ability on generated and real data through contrastive learning, indirectly improving the representational capability of generated data for real data. Comprehensive experiments demonstrate that our method significantly outperforms existing baselines. Code is available at https://github.com/jinglin-liang/DDDR.
comment: Accepted by ECCV 2024 Oral
♻ ☆ ORMNet: Object-centric Relationship Modeling for Egocentric Hand-object Segmentation
Egocentric hand-object segmentation (EgoHOS) is a promising new task aiming at segmenting hands and interacting objects in egocentric images. Although EgoHOS has the potential to enable various applications, current methods struggle to achieve both high performance and end-to-end optimization simultaneously. Moreover, existing approaches fail to fully leverage hand cues to assist the interacting-object segmentation and overlook the coupled relationships between diverse interacting-object categories, resulting in performance deficiencies. To address these limitations, this paper proposes a novel Object-centric Relationship Modeling Network (ORMNet) to fulfill end-to-end and effective EgoHOS by modeling relationships between hands and objects as well as objects and objects. Specifically, a Hand-Object Relation (HOR) module is introduced to capture the correlation between hands and objects, which uses hand features to guide the network to extract more distinguishing interacting-object features. Besides, we find the coupling relations between diverse interacting-object categories and design the Object Relation Decoupling (ORD) strategy to disentangle them, emphasizing learning of the interaction between hands and objects and reducing the confusion of interacting-object classification. In-domain experiments show that ORMNet has notably exceptional segmentation performance compared with state-of-the-art methods, while out-of-domain experiments further exhibit its robust generalization capability. The project is available at https://github.com/yuggiehk/ORMNet/
♻ ☆ MADE-for-ASD: A Multi-Atlas Deep Ensemble Network for Diagnosing Autism Spectrum Disorder
In response to the global need for efficient early diagnosis of Autism Spectrum Disorder (ASD), this paper bridges the gap between traditional, time-consuming diagnostic methods and potential automated solutions. We propose a multi-atlas deep ensemble network, MADE-for-ASD, that integrates multiple atlases of the brain's functional magnetic resonance imaging (fMRI) data through a weighted deep ensemble network. Our approach integrates demographic information into the prediction workflow, which enhances ASD diagnosis performance and offers a more holistic perspective on patient profiling. We experiment with the well-known publicly available ABIDE (Autism Brain Imaging Data Exchange) I dataset, consisting of resting state fMRI data from 17 different laboratories around the globe. Our proposed system achieves 75.20% accuracy on the entire dataset and 96.40% on a specific subset $-$ both surpassing reported ASD diagnosis accuracy in ABIDE I fMRI studies. Specifically, our model improves by 4.4 percentage points over prior works on the same amount of data. The model exhibits a sensitivity of 82.90% and a specificity of 69.70% on the entire dataset, and 91.00% and 99.50%, respectively, on the specific subset. We leverage the F-score to pinpoint the top 10 ROI in ASD diagnosis, such as precuneus and anterior cingulate/ventromedial. The proposed system can potentially pave the way for more cost-effective, efficient and scalable strategies in ASD diagnosis. Codes and evaluations are publicly available at https://github.com/hasan-rakibul/MADE-for-ASD.
comment: Xuehan Liu and Md Rakibul Hasan contributed equally to this work
♻ ☆ MCDubber: Multimodal Context-Aware Expressive Video Dubbing SC2024
Automatic Video Dubbing (AVD) aims to take the given script and generate speech that aligns with lip motion and prosody expressiveness. Current AVD models mainly utilize visual information of the current sentence to enhance the prosody of synthesized speech. However, it is crucial to consider whether the prosody of the generated dubbing aligns with the multimodal context, as the dubbing will be combined with the original context in the final video. This aspect has been overlooked in previous studies. To address this issue, we propose a Multimodal Context-aware video Dubbing model, termed \textbf{MCDubber}, to convert the modeling object from a single sentence to a longer sequence with context information to ensure the consistency of the global context prosody. MCDubber comprises three main components: (1) A context duration aligner aims to learn the context-aware alignment between the text and lip frames; (2) A context prosody predictor seeks to read the global context visual sequence and predict the context-aware global energy and pitch; (3) A context acoustic decoder ultimately predicts the global context mel-spectrogram with the assistance of adjacent ground-truth mel-spectrograms of the target sentence. Through this process, MCDubber fully considers the influence of multimodal context on the prosody expressiveness of the current sentence when dubbing. The extracted mel-spectrogram belonging to the target sentence from the output context mel-spectrograms is the final required dubbing audio. Extensive experiments on the Chem benchmark dataset demonstrate that our MCDubber significantly improves dubbing expressiveness compared to all advanced baselines. The code and demos are available at https://github.com/XiaoYuanJun-zy/MCDubber.
comment: Accepted by NCMMSC2024
♻ ☆ Asynchronous Blob Tracker for Event Cameras
Event-based cameras are popular for tracking fast-moving objects due to their high temporal resolution, low latency, and high dynamic range. In this paper, we propose a novel algorithm for tracking event blobs using raw events asynchronously in real time. We introduce the concept of an event blob as a spatio-temporal likelihood of event occurrence where the conditional spatial likelihood is blob-like. Many real-world objects such as car headlights or any quickly moving foreground objects generate event blob data. The proposed algorithm uses a nearest neighbour classifier with a dynamic threshold criteria for data association coupled with an extended Kalman filter to track the event blob state. Our algorithm achieves highly accurate blob tracking, velocity estimation, and shape estimation even under challenging lighting conditions and high-speed motions (> 11000 pixels/s). The microsecond time resolution achieved means that the filter output can be used to derive secondary information such as time-to-contact or range estimation, that will enable applications to real-world problems such as collision avoidance in autonomous driving.
comment: 18 pages, 16 figures. The manuscript was accepted on August 7, 2024, by IEEE Transactions on Robotics
♻ ☆ From Lab to Field: Real-World Evaluation of an AI-Driven Smart Video Solution to Enhance Community Safety
This article adopts and evaluates an AI-enabled Smart Video Solution (SVS) designed to enhance safety in the real world. The system integrates with existing infrastructure camera networks, leveraging recent advancements in AI for easy adoption. Prioritizing privacy and ethical standards, pose based data is used for downstream AI tasks such as anomaly detection. Cloud-based infrastructure and mobile app are deployed, enabling real-time alerts within communities. The SVS employs innovative data representation and visualization techniques, such as the Occupancy Indicator, Statistical Anomaly Detection, Bird's Eye View, and Heatmaps, to understand pedestrian behaviors and enhance public safety. Evaluation of the SVS demonstrates its capacity to convert complex computer vision outputs into actionable insights for stakeholders, community partners, law enforcement, urban planners, and social scientists. This article presents a comprehensive real-world deployment and evaluation of the SVS, implemented in a community college environment across 16 cameras. The system integrates AI-driven visual processing, supported by statistical analysis, database management, cloud communication, and user notifications. Additionally, the article evaluates the end-to-end latency from the moment an AI algorithm detects anomalous behavior in real-time at the camera level to the time stakeholders receive a notification. The results demonstrate the system's robustness, effectively managing 16 CCTV cameras with a consistent throughput of 16.5 frames per second (FPS) over a 21-hour period and an average end-to-end latency of 26.76 seconds between anomaly detection and alert issuance.
Information Retrieval
☆ Bioinformatics Retrieval Augmentation Data (BRAD) Digital Assistant
We present a prototype for a Bioinformatics Retrieval Augmentation Data (BRAD) digital assistant. BRAD integrates a suite of tools to handle a wide range of bioinformatics tasks, from code execution to online search. We demonstrate BRAD's capabilities through (1) improved question-and-answering with retrieval augmented generation (RAG), (2) BRAD's ability to run and write complex software pipelines, and (3) BRAD's ability to organize and distribute tasks across individual and teams of agents. We use BRAD for automation of bioinformatics workflows, performing tasks ranging from gene enrichment and searching the archive to automatic code generation and running biomarker identification pipelines. BRAD is a step toward the ultimate goal to develop a digital twin of laboratories driven by self-contained loops for hypothesis generation and testing of digital biology experiments.
☆ Building a Scalable, Effective, and Steerable Search and Ranking Platform
Modern e-commerce platforms offer vast product selections, making it difficult for customers to find items that they like and that are relevant to their current session intent. This is why it is key for e-commerce platforms to have near real-time scalable and adaptable personalized ranking and search systems. While numerous methods exist in the scientific literature for building such systems, many are unsuitable for large-scale industrial use due to complexity and performance limitations. Consequently, industrial ranking systems often resort to computationally efficient yet simplistic retrieval or candidate generation approaches, which overlook near real-time and heterogeneous customer signals, which results in a less personalized and relevant experience. Moreover, related customer experiences are served by completely different systems, which increases complexity, maintenance, and inconsistent experiences. In this paper, we present a personalized, adaptable near real-time ranking platform that is reusable across various use cases, such as browsing and search, and that is able to cater to millions of items and customers under heavy load (thousands of requests per second). We employ transformer-based models through different ranking layers which can learn complex behavior patterns directly from customer action sequences while being able to incorporate temporal (e.g. in-session) and contextual information. We validate our system through a series of comprehensive offline and online real-world experiments at a large online e-commerce platform, and we demonstrate its superiority when compared to existing systems, both in terms of customer experience as well as in net revenue. Finally, we share the lessons learned from building a comprehensive, modern ranking platform for use in a large-scale e-commerce environment.
☆ Pooling And Attention: What Are Effective Designs For LLm-Based Embedding Models?
The significant advancements of Large Language Models (LLMs) in generative tasks have led to a growing body of work exploring LLM-based embedding models. While these models, employing different pooling and attention strategies, have achieved state-of-the-art performance on public embedding benchmarks, questions still arise about what constitutes an effective design for LLM-based embedding models. However, these models are often trained on different datasets, using different LLM base models or training settings. Moreover, evaluations on public embedding benchmarks often fail to report statistical significance, making it difficult to determine which designs truly contribute to final performance. This complicates the process for practitioners seeking optimal training recipes for LLM-based embedding models. In this study, we conduct a large-scale experiment by training a series of LLM-based embedding models using the same training data and base model but differing in their pooling and attention strategies. The results show that there is no one-size-fits-all solution: while bidirectional attention and an additional trainable pooling layer outperform in text similarity and information retrieval tasks, they do not significantly surpass simpler designs like EOS-last token pooling and default causal attention in clustering and classification tasks. Furthermore, we propose a new pooling strategy, Multi-Layers Trainable Pooling, which transforms the outputs of all hidden layers, rather than just the last layer, using a cross-attention network. This method proves to be statistically superior in text similarity and retrieval tasks compared to existing pooling methods. Overall, this paper sheds light on effective training strategies for LLM-based embedding models.
comment: https://github.com/yixuantt/PoolingAndAttn
☆ RouterRetriever: Exploring the Benefits of Routing over Multiple Expert Embedding Models
Information retrieval methods often rely on a single embedding model trained on large, general-domain datasets like MSMARCO. While this approach can produce a retriever with reasonable overall performance, models trained on domain-specific data often yield better results within their respective domains. While prior work in information retrieval has tackled this through multi-task training, the topic of combining multiple domain-specific expert retrievers remains unexplored, despite its popularity in language model generation. In this work, we introduce RouterRetriever, a retrieval model that leverages multiple domain-specific experts along with a routing mechanism to select the most appropriate expert for each query. It is lightweight and allows easy addition or removal of experts without additional training. Evaluation on the BEIR benchmark demonstrates that RouterRetriever outperforms both MSMARCO-trained (+2.1 absolute nDCG@10) and multi-task trained (+3.2) models. This is achieved by employing our routing mechanism, which surpasses other routing techniques (+1.8 on average) commonly used in language modeling. Furthermore, the benefit generalizes well to other datasets, even in the absence of a specific expert on the dataset. To our knowledge, RouterRetriever is the first work to demonstrate the advantages of using multiple domain-specific expert embedding models with effective routing over a single, general-purpose embedding model in retrieval tasks.
☆ A Fashion Item Recommendation Model in Hyperbolic Space CVPR 2024
In this work, we propose a fashion item recommendation model that incorporates hyperbolic geometry into user and item representations. Using hyperbolic space, our model aims to capture implicit hierarchies among items based on their visual data and users' purchase history. During training, we apply a multi-task learning framework that considers both hyperbolic and Euclidean distances in the loss function. Our experiments on three data sets show that our model performs better than previous models trained in Euclidean space only, confirming the effectiveness of our model. Our ablation studies show that multi-task learning plays a key role, and removing the Euclidean loss substantially deteriorates the model performance.
comment: This work was presented at the CVFAD Workshop at CVPR 2024
☆ AlignGroup: Learning and Aligning Group Consensus with Member Preferences for Group Recommendation CIKM 2024
Group activities are important behaviors in human society, providing personalized recommendations for groups is referred to as the group recommendation task. Existing methods can usually be categorized into two strategies to infer group preferences: 1) determining group preferences by aggregating members' personalized preferences, and 2) inferring group consensus by capturing group members' coherent decisions after common compromises. However, the former would suffer from the lack of group-level considerations, and the latter overlooks the fine-grained preferences of individual users. To this end, we propose a novel group recommendation method AlignGroup, which focuses on both group consensus and individual preferences of group members to infer the group decision-making. Specifically, AlignGroup explores group consensus through a well-designed hypergraph neural network that efficiently learns intra- and inter-group relationships. Moreover, AlignGroup innovatively utilizes a self-supervised alignment task to capture fine-grained group decision-making by aligning the group consensus with members' common preferences. Extensive experiments on two real-world datasets validate that our AlignGroup outperforms the state-of-the-art on both the group recommendation task and the user recommendation task, as well as outperforms the efficiency of most baselines.
comment: 10 pages, accepted by CIKM 2024
☆ iRangeGraph: Improvising Range-dedicated Graphs for Range-filtering Nearest Neighbor Search SIGMOD 2025
Range-filtering approximate nearest neighbor (RFANN) search is attracting increasing attention in academia and industry. Given a set of data objects, each being a pair of a high-dimensional vector and a numeric value, an RFANN query with a vector and a numeric range as parameters returns the data object whose numeric value is in the query range and whose vector is nearest to the query vector. To process this query, a recent study proposes to build $O(n^2)$ dedicated graph-based indexes for all possible query ranges to enable efficient processing on a database of $n$ objects. As storing all these indexes is prohibitively expensive, the study constructs compressed indexes instead, which reduces the memory consumption considerably. However, this incurs suboptimal performance because the compression is lossy. In this study, instead of materializing a compressed index for every possible query range in preparation for querying, we materialize graph-based indexes, called elemental graphs, for a moderate number of ranges. We then provide an effective and efficient algorithm that during querying can construct an index for any query range using the elemental graphs. We prove that the time needed to construct such an index is low. We also cover an experimental study on real-world datasets that provides evidence that the materialized elemental graphs only consume moderate space and that the proposed method is capable of superior and stable query performance across different query workloads.
comment: The paper has been accepted by SIGMOD 2025
☆ An Effective Tag Assignment Approach for Billboard Advertisement
Billboard Advertisement has gained popularity due to its significant outrage in return on investment. To make this advertisement approach more effective, the relevant information about the product needs to be reached to the relevant set of people. This can be achieved if the relevant set of tags can be mapped to the correct slots. Formally, we call this problem the Tag Assignment Problem in Billboard Advertisement. Given trajectory, billboard database, and a set of selected billboard slots and tags, this problem asks to output a mapping of selected tags to the selected slots so that the influence is maximized. We model this as a variant of traditional bipartite matching called One-To-Many Bipartite Matching (OMBM). Unlike traditional bipartite matching, a tag can be assigned to only one slot; in the OMBM, a tag can be assigned to multiple slots while the vice versa can not happen. We propose an iterative solution approach that incrementally allocates the tags to the slots. The proposed methodology has been explained with an illustrated example. A complexity analysis of the proposed solution approach has also been conducted. The experimental results on real-world trajectory and billboard datasets prove our claim on the effectiveness and efficiency of the proposed solution.
comment: This Paper has been accepted at The 25th International Web Information Systems Engineering Conference (WISE-2024)
☆ Deep Adaptive Interest Network: Personalized Recommendation with Context-Aware Learning
In personalized recommendation systems, accurately capturing users' evolving interests and combining them with contextual information is a critical research area. This paper proposes a novel model called the Deep Adaptive Interest Network (DAIN), which dynamically models users' interests while incorporating context-aware learning mechanisms to achieve precise and adaptive personalized recommendations. DAIN leverages deep learning techniques to build an adaptive interest network structure that can capture users' interest changes in real-time while further optimizing recommendation results by integrating contextual information. Experiments conducted on several public datasets demonstrate that DAIN excels in both recommendation performance and computational efficiency. This research not only provides a new solution for personalized recommendation systems but also offers fresh insights into the application of context-aware learning in recommendation systems.
☆ NUDGE: Lightweight Non-Parametric Fine-Tuning of Embeddings for Retrieval
$k$-Nearest Neighbor search on dense vector embeddings ($k$-NN retrieval) from pre-trained embedding models is the predominant retrieval method for text and images, as well as Retrieval-Augmented Generation (RAG) pipelines. In practice, application developers often fine-tune the embeddings to improve their accuracy on the dataset and query workload in hand. Existing approaches either fine-tune the pre-trained model itself or, more efficiently, but at the cost of accuracy, train adaptor models to transform the output of the pre-trained model. We present NUDGE, a family of novel non-parametric embedding fine-tuning approaches that are significantly more accurate and efficient than both sets of existing approaches. NUDGE directly modifies the embeddings of data records to maximize the accuracy of $k$-NN retrieval. We present a thorough theoretical and experimental study of NUDGE's non-parametric approach. We show that even though the underlying problem is NP-Hard, constrained variations can be solved efficiently. These constraints additionally ensure that the changes to the embeddings are modest, avoiding large distortions to the semantics learned during pre-training. In experiments across five pre-trained models and nine standard text and image retrieval datasets, NUDGE runs in minutes and often improves NDCG@10 by more than 10% over existing fine-tuning methods. On average, NUDGE provides 3.3x and 4.3x higher increase in accuracy and runs 200x and 3x faster, respectively, over fine-tuning the pre-trained model and training adaptors.
♻ ☆ The Design of an LLM-powered Unstructured Analytics System
LLMs demonstrate an uncanny ability to process unstructured data, and as such, have the potential to go beyond search and run complex, semantic analyses at scale. We describe the design of an unstructured analytics system, Aryn, and the tenets and use cases that motivate its design. With Aryn, users can specify queries in natural language and the system automatically determines a semantic plan and executes it to compute an answer from a large collection of unstructured documents using LLMs. At the core of Aryn is Sycamore, a declarative document processing engine, built using Ray, that provides a reliable distributed abstraction called DocSets. Sycamore allows users to analyze, enrich, and transform complex documents at scale. Aryn also comprises Luna, a query planner that translates natural language queries to Sycamore scripts, and the Aryn Partitioner, which takes raw PDFs and document images, and converts them to DocSets for downstream processing. Using Aryn, we demonstrate a real world use case for analyzing accident reports from the National Transportation Safety Board (NTSB), and discuss some of the major challenges we encountered in deploying Aryn in the wild.
comment: 6 pages, 3 figures, fixed typos
♻ ☆ Multimodal Recommender Systems: A Survey
The recommender system (RS) has been an integral toolkit of online services. They are equipped with various deep learning techniques to model user preference based on identifier and attribute information. With the emergence of multimedia services, such as short videos, news and etc., understanding these contents while recommending becomes critical. Besides, multimodal features are also helpful in alleviating the problem of data sparsity in RS. Thus, Multimodal Recommender System (MRS) has attracted much attention from both academia and industry recently. In this paper, we will give a comprehensive survey of the MRS models, mainly from technical views. First, we conclude the general procedures and major challenges for MRS. Then, we introduce the existing MRS models according to four categories, i.e., Modality Encoder, Feature Interaction, Feature Enhancement and Model Optimization. Besides, to make it convenient for those who want to research this field, we also summarize the dataset and code resources. Finally, we discuss some promising future directions of MRS and conclude this paper. To access more details of the surveyed papers, such as implementation code, we open source a repository.
comment: accepted by CSUR
♻ ☆ MARS: Matching Attribute-aware Representations for Text-based Sequential Recommendation CIKM 2024
Sequential recommendation aims to predict the next item a user is likely to prefer based on their sequential interaction history. Recently, text-based sequential recommendation has emerged as a promising paradigm that uses pre-trained language models to exploit textual item features to enhance performance and facilitate knowledge transfer to unseen datasets. However, existing text-based recommender models still struggle with two key challenges: (i) representing users and items with multiple attributes, and (ii) matching items with complex user interests. To address these challenges, we propose a novel model, Matching Attribute-aware Representations for Text-based Sequential Recommendation (MARS). MARS extracts detailed user and item representations through attribute-aware text encoding, capturing diverse user intents with multiple attribute-aware representations. It then computes user-item scores via attribute-wise interaction matching, effectively capturing attribute-level user preferences. Our extensive experiments demonstrate that MARS significantly outperforms existing sequential models, achieving improvements of up to 24.43% and 29.26% in Recall@10 and NDCG@10 across five benchmark datasets. Code is available at https://github.com/junieberry/MARS
comment: CIKM 2024
♻ ☆ HIRO: Hierarchical Information Retrieval Optimization
Retrieval-Augmented Generation (RAG) has revolutionized natural language processing by dynamically integrating external knowledge into Large Language Models (LLMs), addressing their limitation of static training datasets. Recent implementations of RAG leverage hierarchical data structures, which organize documents at various levels of summarization and information density. This complexity, however, can cause LLMs to "choke" on information overload, necessitating more sophisticated querying mechanisms. In this context, we introduce Hierarchical Information Retrieval Optimization (HIRO), a novel querying approach that employs a Depth-First Search (DFS)-based recursive similarity score calculation and branch pruning. This method uniquely minimizes the context delivered to the LLM without informational loss, effectively managing the challenge of excessive data. HIRO's refined approach is validated by a 10.85% improvement in performance on the NarrativeQA dataset.
♻ ☆ Large Language Models for Information Retrieval: A Survey
As a primary means of information acquisition, information retrieval (IR) systems, such as search engines, have integrated themselves into our daily lives. These systems also serve as components of dialogue, question-answering, and recommender systems. The trajectory of IR has evolved dynamically from its origins in term-based methods to its integration with advanced neural models. While the neural models excel at capturing complex contextual signals and semantic nuances, thereby reshaping the IR landscape, they still face challenges such as data scarcity, interpretability, and the generation of contextually plausible yet potentially inaccurate responses. This evolution requires a combination of both traditional methods (such as term-based sparse retrieval methods with rapid response) and modern neural architectures (such as language models with powerful language understanding capacity). Meanwhile, the emergence of large language models (LLMs), typified by ChatGPT and GPT-4, has revolutionized natural language processing due to their remarkable language understanding, generation, generalization, and reasoning abilities. Consequently, recent research has sought to leverage LLMs to improve IR systems. Given the rapid evolution of this research trajectory, it is necessary to consolidate existing methodologies and provide nuanced insights through a comprehensive overview. In this survey, we delve into the confluence of LLMs and IR systems, including crucial aspects such as query rewriters, retrievers, rerankers, and readers. Additionally, we explore promising directions, such as search agents, within this expanding field.
comment: updated to version 3
♻ ☆ Smart E-commerce Recommendations with Semantic AI
In e-commerce, web mining for page recommendations is widely used but often fails to meet user needs. To address this, we propose a novel solution combining semantic web mining with BP neural networks. We process user search logs to extract five key features: content priority, time spent, user feedback, recommendation semantics, and input deviation. These features are then fed into a BP neural network to classify and prioritize web pages. The prioritized pages are recommended to users. Using book sales pages for testing, our results demonstrate that this solution can quickly and accurately identify the pages users need. Our approach ensures that recommendations are more relevant and tailored to individual preferences, enhancing the online shopping experience. By leveraging advanced semantic analysis and neural network techniques, we bridge the gap between user expectations and actual recommendations. This innovative method not only improves accuracy but also speeds up the recommendation process, making it a valuable tool for e-commerce platforms aiming to boost user satisfaction and engagement. Additionally, our system ability to handle large datasets and provide real-time recommendations makes it a scalable and efficient solution for modern e-commerce challenges.
comment: 8 pages
♻ ☆ NFARec: A Negative Feedback-Aware Recommender Model SIGIR 2024
Graph neural network (GNN)-based models have been extensively studied for recommendations, as they can extract high-order collaborative signals accurately which is required for high-quality recommender systems. However, they neglect the valuable information gained through negative feedback in two aspects: (1) different users might hold opposite feedback on the same item, which hampers optimal information propagation in GNNs, and (2) even when an item vastly deviates from users' preferences, they might still choose it and provide a negative rating. In this paper, we propose a negative feedback-aware recommender model (NFARec) that maximizes the leverage of negative feedback. To transfer information to multi-hop neighbors along an optimal path effectively, NFARec adopts a feedback-aware correlation that guides hypergraph convolutions (HGCs) to learn users' structural representations. Moreover, NFARec incorporates an auxiliary task - predicting the feedback sentiment polarity (i.e., positive or negative) of the next interaction - based on the Transformer Hawkes Process. The task is beneficial for understanding users by learning the sentiment expressed in their previous sequential feedback patterns and predicting future interactions. Extensive experiments demonstrate that NFARec outperforms competitive baselines. Our source code and data are released at https://github.com/WangXFng/NFARec.
comment: Accepted to SIGIR 2024
♻ ☆ CaDRec: Contextualized and Debiased Recommender Model SIGIR 2024
Recommender models aimed at mining users' behavioral patterns have raised great attention as one of the essential applications in daily life. Recent work on graph neural networks (GNNs) or debiasing methods has attained remarkable gains. However, they still suffer from (1) over-smoothing node embeddings caused by recursive convolutions with GNNs, and (2) the skewed distribution of interactions due to popularity and user-individual biases. This paper proposes a contextualized and debiased recommender model (CaDRec). To overcome the over-smoothing issue, we explore a novel hypergraph convolution operator that can select effective neighbors during convolution by introducing both structural context and sequential context. To tackle the skewed distribution, we propose two strategies for disentangling interactions: (1) modeling individual biases to learn unbiased item embeddings, and (2) incorporating item popularity with positional encoding. Moreover, we mathematically show that the imbalance of the gradients to update item embeddings exacerbates the popularity bias, thus adopting regularization and weighting schemes as solutions. Extensive experiments on four datasets demonstrate the superiority of the CaDRec against state-of-the-art (SOTA) methods. Our source code and data are released at https://github.com/WangXFng/CaDRec.
comment: Accepted to SIGIR 2024
♻ ☆ Evaluating Named Entity Recognition Using Few-Shot Prompting with Large Language Models
This paper evaluates Few-Shot Prompting with Large Language Models for Named Entity Recognition (NER). Traditional NER systems rely on extensive labeled datasets, which are costly and time-consuming to obtain. Few-Shot Prompting or in-context learning enables models to recognize entities with minimal examples. We assess state-of-the-art models like GPT-4 in NER tasks, comparing their few-shot performance to fully supervised benchmarks. Results show that while there is a performance gap, large models excel in adapting to new entity types and domains with very limited data. We also explore the effects of prompt engineering, guided output format and context length on performance. This study underscores Few-Shot Learning's potential to reduce the need for large labeled datasets, enhancing NER scalability and accessibility.
comment: Github repo: https://github.com/GEODE-project/ner-llm
♻ ☆ Jina-ColBERT-v2: A General-Purpose Multilingual Late Interaction Retriever EMNLP
Multi-vector dense models, such as ColBERT, have proven highly effective in information retrieval. ColBERT's late interaction scoring approximates the joint query-document attention seen in cross-encoders while maintaining inference efficiency closer to traditional dense retrieval models, thanks to its bi-encoder architecture and recent optimizations in indexing and search. In this paper, we introduce a novel architecture and a training framework to support long context window and multilingual retrieval. Our new model, Jina-ColBERT-v2, demonstrates strong performance across a range of English and multilingual retrieval tasks,
comment: 8 pages, references at pp7,8; EMNLP workshop submission
Machine Learning
☆ Masked Diffusion Models are Secretly Time-Agnostic Masked Models and Exploit Inaccurate Categorical Sampling
Masked diffusion models (MDMs) have emerged as a popular research topic for generative modeling of discrete data, thanks to their superior performance over other discrete diffusion models, and are rivaling the auto-regressive models (ARMs) for language modeling tasks. The recent effort in simplifying the masked diffusion framework further leads to alignment with continuous-space diffusion models and more principled training and sampling recipes. In this paper, however, we reveal that both training and sampling of MDMs are theoretically free from the time variable, arguably the key signature of diffusion models, and are instead equivalent to masked models. The connection on the sampling aspect is drawn by our proposed first-hitting sampler (FHS). Specifically, we show that the FHS is theoretically equivalent to MDMs' original generation process while significantly alleviating the time-consuming categorical sampling and achieving a 20$\times$ speedup. In addition, our investigation challenges previous claims that MDMs can surpass ARMs in generative perplexity. We identify, for the first time, an underlying numerical issue, even with the 32-bit floating-point precision, which results in inaccurate categorical sampling. We show that the numerical issue lowers the effective temperature both theoretically and empirically, leading to unfair assessments of MDMs' generation results in the previous literature.
comment: 40 pages
☆ Topological Methods in Machine Learning: A Tutorial for Practitioners
Topological Machine Learning (TML) is an emerging field that leverages techniques from algebraic topology to analyze complex data structures in ways that traditional machine learning methods may not capture. This tutorial provides a comprehensive introduction to two key TML techniques, persistent homology and the Mapper algorithm, with an emphasis on practical applications. Persistent homology captures multi-scale topological features such as clusters, loops, and voids, while the Mapper algorithm creates an interpretable graph summarizing high-dimensional data. To enhance accessibility, we adopt a data-centric approach, enabling readers to gain hands-on experience applying these techniques to relevant tasks. We provide step-by-step explanations, implementations, hands-on examples, and case studies to demonstrate how these tools can be applied to real-world problems. The goal is to equip researchers and practitioners with the knowledge and resources to incorporate TML into their work, revealing insights often hidden from conventional machine learning methods. The tutorial code is available at https://github.com/cakcora/TopologyForML
comment: 54 pages, 35 figures
☆ Regional data-driven weather modeling with a global stretched-grid
A data-driven model (DDM) suitable for regional weather forecasting applications is presented. The model extends the Artificial Intelligence Forecasting System by introducing a stretched-grid architecture that dedicates higher resolution over a regional area of interest and maintains a lower resolution elsewhere on the globe. The model is based on graph neural networks, which naturally affords arbitrary multi-resolution grid configurations. The model is applied to short-range weather prediction for the Nordics, producing forecasts at 2.5 km spatial and 6 h temporal resolution. The model is pre-trained on 43 years of global ERA5 data at 31 km resolution and is further refined using 3.3 years of 2.5 km resolution operational analyses from the MetCoOp Ensemble Prediction System (MEPS). The performance of the model is evaluated using surface observations from measurement stations across Norway and is compared to short-range weather forecasts from MEPS. The DDM outperforms both the control run and the ensemble mean of MEPS for 2 m temperature. The model also produces competitive precipitation and wind speed forecasts, but is shown to underestimate extreme events.
☆ Benchmarking Spurious Bias in Few-Shot Image Classifiers ECCV 2024
Few-shot image classifiers are designed to recognize and classify new data with minimal supervision and limited data but often show reliance on spurious correlations between classes and spurious attributes, known as spurious bias. Spurious correlations commonly hold in certain samples and few-shot classifiers can suffer from spurious bias induced from them. There is an absence of an automatic benchmarking system to assess the robustness of few-shot classifiers against spurious bias. In this paper, we propose a systematic and rigorous benchmark framework, termed FewSTAB, to fairly demonstrate and quantify varied degrees of robustness of few-shot classifiers to spurious bias. FewSTAB creates few-shot evaluation tasks with biased attributes so that using them for predictions can demonstrate poor performance. To construct these tasks, we propose attribute-based sample selection strategies based on a pre-trained vision-language model, eliminating the need for manual dataset curation. This allows FewSTAB to automatically benchmark spurious bias using any existing test data. FewSTAB offers evaluation results in a new dimension along with a new design guideline for building robust classifiers. Moreover, it can benchmark spurious bias in varied degrees and enable designs for varied degrees of robustness. Its effectiveness is demonstrated through experiments on ten few-shot learning methods across three datasets. We hope our framework can inspire new designs of robust few-shot classifiers. Our code is available at https://github.com/gtzheng/FewSTAB.
comment: Accepted to ECCV 2024
☆ Configurable Foundation Models: Building LLMs from a Modular Perspective
Advancements in LLMs have recently unveiled challenges tied to computational efficiency and continual scalability due to their requirements of huge parameters, making the applications and evolution of these models on devices with limited computation resources and scenarios requiring various abilities increasingly cumbersome. Inspired by modularity within the human brain, there is a growing tendency to decompose LLMs into numerous functional modules, allowing for inference with part of modules and dynamic assembly of modules to tackle complex tasks, such as mixture-of-experts. To highlight the inherent efficiency and composability of the modular approach, we coin the term brick to represent each functional module, designating the modularized structure as configurable foundation models. In this paper, we offer a comprehensive overview and investigation of the construction, utilization, and limitation of configurable foundation models. We first formalize modules into emergent bricks - functional neuron partitions that emerge during the pre-training phase, and customized bricks - bricks constructed via additional post-training to improve the capabilities and knowledge of LLMs. Based on diverse functional bricks, we further present four brick-oriented operations: retrieval and routing, merging, updating, and growing. These operations allow for dynamic configuration of LLMs based on instructions to handle complex tasks. To verify our perspective, we conduct an empirical analysis on widely-used LLMs. We find that the FFN layers follow modular patterns with functional specialization of neurons and functional neuron partitions. Finally, we highlight several open issues and directions for future research. Overall, this paper aims to offer a fresh modular perspective on existing LLM research and inspire the future creation of more efficient and scalable foundational models.
☆ Hybrid Imitation-Learning Motion Planner for Urban Driving
With the release of open source datasets such as nuPlan and Argoverse, the research around learning-based planners has spread a lot in the last years. Existing systems have shown excellent capabilities in imitating the human driver behaviour, but they struggle to guarantee safe closed-loop driving. Conversely, optimization-based planners offer greater security in short-term planning scenarios. To confront this challenge, in this paper we propose a novel hybrid motion planner that integrates both learning-based and optimization-based techniques. Initially, a multilayer perceptron (MLP) generates a human-like trajectory, which is then refined by an optimization-based component. This component not only minimizes tracking errors but also computes a trajectory that is both kinematically feasible and collision-free with obstacles and road boundaries. Our model effectively balances safety and human-likeness, mitigating the trade-off inherent in these objectives. We validate our approach through simulation experiments and further demonstrate its efficacy by deploying it in real-world self-driving vehicles.
☆ Look Into the LITE in Deep Learning for Time Series Classification
Deep learning models have been shown to be a powerful solution for Time Series Classification (TSC). State-of-the-art architectures, while producing promising results on the UCR and the UEA archives , present a high number of trainable parameters. This can lead to long training with high CO2 emission, power consumption and possible increase in the number of FLoating-point Operation Per Second (FLOPS). In this paper, we present a new architecture for TSC, the Light Inception with boosTing tEchnique (LITE) with only 2.34% of the number of parameters of the state-of-the-art InceptionTime model, while preserving performance. This architecture, with only 9, 814 trainable parameters due to the usage of DepthWise Separable Convolutions (DWSC), is boosted by three techniques: multiplexing, custom filters, and dilated convolution. The LITE architecture, trained on the UCR, is 2.78 times faster than InceptionTime and consumes 2.79 times less CO2 and power. To evaluate the performance of the proposed architecture on multivariate time series data, we adapt LITE to handle multivariate time series, we call this version LITEMV. To bring theory into application, we also conducted experiments using LITEMV on multivariate time series representing human rehabilitation movements, showing that LITEMV not only is the most efficient model but also the best performing for this application on the Kimore dataset, a skeleton based human rehabilitation exercises dataset. Moreover, to address the interpretability of LITEMV, we present a study using Class Activation Maps to understand the classification decision taken by the model during evaluation.
☆ Building a Scalable, Effective, and Steerable Search and Ranking Platform
Modern e-commerce platforms offer vast product selections, making it difficult for customers to find items that they like and that are relevant to their current session intent. This is why it is key for e-commerce platforms to have near real-time scalable and adaptable personalized ranking and search systems. While numerous methods exist in the scientific literature for building such systems, many are unsuitable for large-scale industrial use due to complexity and performance limitations. Consequently, industrial ranking systems often resort to computationally efficient yet simplistic retrieval or candidate generation approaches, which overlook near real-time and heterogeneous customer signals, which results in a less personalized and relevant experience. Moreover, related customer experiences are served by completely different systems, which increases complexity, maintenance, and inconsistent experiences. In this paper, we present a personalized, adaptable near real-time ranking platform that is reusable across various use cases, such as browsing and search, and that is able to cater to millions of items and customers under heavy load (thousands of requests per second). We employ transformer-based models through different ranking layers which can learn complex behavior patterns directly from customer action sequences while being able to incorporate temporal (e.g. in-session) and contextual information. We validate our system through a series of comprehensive offline and online real-world experiments at a large online e-commerce platform, and we demonstrate its superiority when compared to existing systems, both in terms of customer experience as well as in net revenue. Finally, we share the lessons learned from building a comprehensive, modern ranking platform for use in a large-scale e-commerce environment.
☆ Oops, I Sampled it Again: Reinterpreting Confidence Intervals in Few-Shot Learning
The predominant method for computing confidence intervals (CI) in few-shot learning (FSL) is based on sampling the tasks with replacement, i.e.\ allowing the same samples to appear in multiple tasks. This makes the CI misleading in that it takes into account the randomness of the sampler but not the data itself. To quantify the extent of this problem, we conduct a comparative analysis between CIs computed with and without replacement. These reveal a notable underestimation by the predominant method. This observation calls for a reevaluation of how we interpret confidence intervals and the resulting conclusions in FSL comparative studies. Our research demonstrates that the use of paired tests can partially address this issue. Additionally, we explore methods to further reduce the (size of the) CI by strategically sampling tasks of a specific size. We also introduce a new optimized benchmark, which can be accessed at https://github.com/RafLaf/FSL-benchmark-again
☆ SNNAX -- Spiking Neural Networks in JAX
Spiking Neural Networks (SNNs) simulators are essential tools to prototype biologically inspired models and neuromorphic hardware architectures and predict their performance. For such a tool, ease of use and flexibility are critical, but so is simulation speed especially given the complexity inherent to simulating SNN. Here, we present SNNAX, a JAX-based framework for simulating and training such models with PyTorch-like intuitiveness and JAX-like execution speed. SNNAX models are easily extended and customized to fit the desired model specifications and target neuromorphic hardware. Additionally, SNNAX offers key features for optimizing the training and deployment of SNNs such as flexible automatic differentiation and just-in-time compilation. We evaluate and compare SNNAX to other commonly used machine learning (ML) frameworks used for programming SNNs. We provide key performance metrics, best practices, documented examples for simulating SNNs in SNNAX, and implement several benchmarks used in the literature.
☆ Exploring Sentiment Dynamics and Predictive Behaviors in Cryptocurrency Discussions by Few-Shot Learning with Large Language Models
This study performs analysis of Predictive statements, Hope speech, and Regret Detection behaviors within cryptocurrency-related discussions, leveraging advanced natural language processing techniques. We introduce a novel classification scheme named "Prediction statements," categorizing comments into Predictive Incremental, Predictive Decremental, Predictive Neutral, or Non-Predictive categories. Employing GPT-4o, a cutting-edge large language model, we explore sentiment dynamics across five prominent cryptocurrencies: Cardano, Binance, Matic, Fantom, and Ripple. Our analysis reveals distinct patterns in predictive sentiments, with Matic demonstrating a notably higher propensity for optimistic predictions. Additionally, we investigate hope and regret sentiments, uncovering nuanced interplay between these emotions and predictive behaviors. Despite encountering limitations related to data volume and resource availability, our study reports valuable discoveries concerning investor behavior and sentiment trends within the cryptocurrency market, informing strategic decision-making and future research endeavors.
☆ Obsidian: Cooperative State-Space Exploration for Performant Inference on Secure ML Accelerators
Trusted execution environments (TEEs) for machine learning accelerators are indispensable in secure and efficient ML inference. Optimizing workloads through state-space exploration for the accelerator architectures improves performance and energy consumption. However, such explorations are expensive and slow due to the large search space. Current research has to use fast analytical models that forego critical hardware details and cross-layer opportunities unique to the hardware security primitives. While cycle-accurate models can theoretically reach better designs, their high runtime cost restricts them to a smaller state space. We present Obsidian, an optimization framework for finding the optimal mapping from ML kernels to a secure ML accelerator. Obsidian addresses the above challenge by exploring the state space using analytical and cycle-accurate models cooperatively. The two main exploration components include: (1) A secure accelerator analytical model, that includes the effect of secure hardware while traversing the large mapping state space and produce the best m model mappings; (2) A compiler profiling step on a cycle-accurate model, that captures runtime bottlenecks to further improve execution runtime, energy and resource utilization and find the optimal model mapping. We compare our results to a baseline secure accelerator, comprising of the state-of-the-art security schemes obtained from guardnn [ 33 ] and sesame [11]. The analytical model reduces the inference latency by 20.5% for a cloud and 8.4% for an edge deployment with an energy improvement of 24% and 19% respectively. The cycle-accurate model, further reduces the latency by 9.1% for a cloud and 12.2% for an edge with an energy improvement of 13.8% and 13.1%.
☆ Boosting Certificate Robustness for Time Series Classification with Efficient Self-Ensemble
Recently, the issue of adversarial robustness in the time series domain has garnered significant attention. However, the available defense mechanisms remain limited, with adversarial training being the predominant approach, though it does not provide theoretical guarantees. Randomized Smoothing has emerged as a standout method due to its ability to certify a provable lower bound on robustness radius under $\ell_p$-ball attacks. Recognizing its success, research in the time series domain has started focusing on these aspects. However, existing research predominantly focuses on time series forecasting, or under the non-$\ell_p$ robustness in statistic feature augmentation for time series classification~(TSC). Our review found that Randomized Smoothing performs modestly in TSC, struggling to provide effective assurances on datasets with poor robustness. Therefore, we propose a self-ensemble method to enhance the lower bound of the probability confidence of predicted labels by reducing the variance of classification margins, thereby certifying a larger radius. This approach also addresses the computational overhead issue of Deep Ensemble~(DE) while remaining competitive and, in some cases, outperforming it in terms of robustness. Both theoretical analysis and experimental results validate the effectiveness of our method, demonstrating superior performance in robustness testing compared to baseline approaches.
comment: 6 figures, 4 tables, 10 pages
☆ UnLearning from Experience to Avoid Spurious Correlations
While deep neural networks can achieve state-of-the-art performance in many tasks, these models are more fragile than they appear. They are prone to learning spurious correlations in their training data, leading to surprising failure cases. In this paper, we propose a new approach that addresses the issue of spurious correlations: UnLearning from Experience (ULE). Our method is based on using two classification models trained in parallel: student and teacher models. Both models receive the same batches of training data. The student model is trained with no constraints and pursues the spurious correlations in the data. The teacher model is trained to solve the same classification problem while avoiding the mistakes of the student model. As training is done in parallel, the better the student model learns the spurious correlations, the more robust the teacher model becomes. The teacher model uses the gradient of the student's output with respect to its input to unlearn mistakes made by the student. We show that our method is effective on the Waterbirds, CelebA, Spawrious and UrbanCars datasets.
comment: 10 pages
☆ Regularized Multi-output Gaussian Convolution Process with Domain Adaptation
Multi-output Gaussian process (MGP) has been attracting increasing attention as a transfer learning method to model multiple outputs. Despite its high flexibility and generality, MGP still faces two critical challenges when applied to transfer learning. The first one is negative transfer, which occurs when there exists no shared information among the outputs. The second challenge is the input domain inconsistency, which is commonly studied in transfer learning yet not explored in MGP. In this paper, we propose a regularized MGP modeling framework with domain adaptation to overcome these challenges. More specifically, a sparse covariance matrix of MGP is proposed by using convolution process, where penalization terms are added to adaptively select the most informative outputs for knowledge transfer. To deal with the domain inconsistency, a domain adaptation method is proposed by marginalizing inconsistent features and expanding missing features to align the input domains among different outputs. Statistical properties of the proposed method are provided to guarantee the performance practically and asymptotically. The proposed framework outperforms state-of-the-art benchmarks in comprehensive simulation studies and one real case study of a ceramic manufacturing process. The results demonstrate the effectiveness of our method in dealing with both the negative transfer and the domain inconsistency.
☆ Unifying Causal Representation Learning with the Invariance Principle
Causal representation learning aims at recovering latent causal variables from high-dimensional observations to solve causal downstream tasks, such as predicting the effect of new interventions or more robust classification. A plethora of methods have been developed, each tackling carefully crafted problem settings that lead to different types of identifiability. The folklore is that these different settings are important, as they are often linked to different rungs of Pearl's causal hierarchy, although not all neatly fit. Our main contribution is to show that many existing causal representation learning approaches methodologically align the representation to known data symmetries. Identification of the variables is guided by equivalence classes across different data pockets that are not necessarily causal. This result suggests important implications, allowing us to unify many existing approaches in a single method that can mix and match different assumptions, including non-causal ones, based on the invariances relevant to our application. It also significantly benefits applicability, which we demonstrate by improving treatment effect estimation on real-world high-dimensional ecological data. Overall, this paper clarifies the role of causality assumptions in the discovery of causal variables and shifts the focus to preserving data symmetries.
comment: 36 pages
☆ Tractable Offline Learning of Regular Decision Processes
This work studies offline Reinforcement Learning (RL) in a class of non-Markovian environments called Regular Decision Processes (RDPs). In RDPs, the unknown dependency of future observations and rewards from the past interactions can be captured by some hidden finite-state automaton. For this reason, many RDP algorithms first reconstruct this unknown dependency using automata learning techniques. In this paper, we show that it is possible to overcome two strong limitations of previous offline RL algorithms for RDPs, notably RegORL. This can be accomplished via the introduction of two original techniques: the development of a new pseudometric based on formal languages, which removes a problematic dependency on $L_\infty^\mathsf{p}$-distinguishability parameters, and the adoption of Count-Min-Sketch (CMS), instead of naive counting. The former reduces the number of samples required in environments that are characterized by a low complexity in language-theoretic terms. The latter alleviates the memory requirements for long planning horizons. We derive the PAC sample complexity bounds associated to each of these techniques, and we validate the approach experimentally.
comment: To appear in EWRL 2024
☆ Convolutional Neural Networks for Automated Cellular Automaton Classification
The emergent dynamics in spacetime diagrams of cellular automata (CAs) is often organised by means of a number of behavioural classes. Whilst classification of elementary CAs is feasible and well-studied, non-elementary CAs are generally too diverse and numerous to exhaustively classify manually. In this chapter we treat the spacetime diagram as a digital image, and implement simple computer vision techniques to perform an automated classification of elementary cellular automata into the five Li-Packard classes. In particular, we present a supervised learning task to a convolutional neural network, in such a way that it may be generalised to non-elementary CAs. If we want to do so, we must divert the algorithm's focus away from the underlying 'microscopic' local updates. We first show that previously developed deep learning approaches have in fact been trained to identify the local update rule, rather than directly focus on the mesoscopic patterns that are associated with the particular behavioural classes. By means of a well-argued neural network design, as well as a number of data augmentation techniques, we then present a convolutional neural network that performs nearly perfectly at identifying the behavioural class, without necessarily first identifying the underlying microscopic dynamics.
comment: 19 pages, 12 figures, book chapter
☆ Complete and Efficient Covariants for 3D Point Configurations with Application to Learning Molecular Quantum Properties
When modeling physical properties of molecules with machine learning, it is desirable to incorporate $SO(3)$-covariance. While such models based on low body order features are not complete, we formulate and prove general completeness properties for higher order methods, and show that $6k-5$ of these features are enough for up to $k$ atoms. We also find that the Clebsch--Gordan operations commonly used in these methods can be replaced by matrix multiplications without sacrificing completeness, lowering the scaling from $O(l^6)$ to $O(l^3)$ in the degree of the features. We apply this to quantum chemistry, but the proposed methods are generally applicable for problems involving 3D point configurations.
☆ Task-Oriented Communication for Graph Data: A Graph Information Bottleneck Approach
Graph data, essential in fields like knowledge representation and social networks, often involves large networks with many nodes and edges. Transmitting these graphs can be highly inefficient due to their size and redundancy for specific tasks. This paper introduces a method to extract a smaller, task-focused subgraph that maintains key information while reducing communication overhead. Our approach utilizes graph neural networks (GNNs) and the graph information bottleneck (GIB) principle to create a compact, informative, and robust graph representation suitable for transmission. The challenge lies in the irregular structure of graph data, making GIB optimization complex. We address this by deriving a tractable variational upper bound for the objective function. Additionally, we propose the VQ-GIB mechanism, integrating vector quantization (VQ) to convert subgraph representations into a discrete codebook sequence, compatible with existing digital communication systems. Our experiments show that this GIB-based method significantly lowers communication costs while preserving essential task-related information. The approach demonstrates robust performance across various communication channels, suitable for both continuous and discrete systems.
☆ A Data Selection Approach for Enhancing Low Resource Machine Translation Using Cross-Lingual Sentence Representations
Machine translation in low-resource language pairs faces significant challenges due to the scarcity of parallel corpora and linguistic resources. This study focuses on the case of English-Marathi language pairs, where existing datasets are notably noisy, impeding the performance of machine translation models. To mitigate the impact of data quality issues, we propose a data filtering approach based on cross-lingual sentence representations. Our methodology leverages a multilingual SBERT model to filter out problematic translations in the training data. Specifically, we employ an IndicSBERT similarity model to assess the semantic equivalence between original and translated sentences, allowing us to retain linguistically correct translations while discarding instances with substantial deviations. The results demonstrate a significant improvement in translation quality over the baseline post-filtering with IndicSBERT. This illustrates how cross-lingual sentence representations can reduce errors in machine translation scenarios with limited resources. By integrating multilingual sentence BERT models into the translation pipeline, this research contributes to advancing machine translation techniques in low-resource environments. The proposed method not only addresses the challenges in English-Marathi language pairs but also provides a valuable framework for enhancing translation quality in other low-resource language translation tasks.
comment: Accepted at I2CT 2024
☆ Few-shot Multi-Task Learning of Linear Invariant Features with Meta Subspace Pursuit
Data scarcity poses a serious threat to modern machine learning and artificial intelligence, as their practical success typically relies on the availability of big datasets. One effective strategy to mitigate the issue of insufficient data is to first harness information from other data sources possessing certain similarities in the study design stage, and then employ the multi-task or meta learning framework in the analysis stage. In this paper, we focus on multi-task (or multi-source) linear models whose coefficients across tasks share an invariant low-rank component, a popular structural assumption considered in the recent multi-task or meta learning literature. Under this assumption, we propose a new algorithm, called Meta Subspace Pursuit (abbreviated as Meta-SP), that provably learns this invariant subspace shared by different tasks. Under this stylized setup for multi-task or meta learning, we establish both the algorithmic and statistical guarantees of the proposed method. Extensive numerical experiments are conducted, comparing Meta-SP against several competing methods, including popular, off-the-shelf model-agnostic meta learning algorithms such as ANIL. These experiments demonstrate that Meta-SP achieves superior performance over the competing methods in various aspects.
☆ Decision Transformer for Enhancing Neural Local Search on the Job Shop Scheduling Problem
The job shop scheduling problem (JSSP) and its solution algorithms have been of enduring interest in both academia and industry for decades. In recent years, machine learning (ML) is playing an increasingly important role in advancing existing and building new heuristic solutions for the JSSP, aiming to find better solutions in shorter computation times. In this paper we build on top of a state-of-the-art deep reinforcement learning (DRL) agent, called Neural Local Search (NLS), which can efficiently and effectively control a large local neighborhood search on the JSSP. In particular, we develop a method for training the decision transformer (DT) algorithm on search trajectories taken by a trained NLS agent to further improve upon the learned decision-making sequences. Our experiments show that the DT successfully learns local search strategies that are different and, in many cases, more effective than those of the NLS agent itself. In terms of the tradeoff between solution quality and acceptable computational time needed for the search, the DT is particularly superior in application scenarios where longer computational times are acceptable. In this case, it makes up for the longer inference times required per search step, which are caused by the larger neural network architecture, through better quality decisions per step. Thereby, the DT achieves state-of-the-art results for solving the JSSP with ML-enhanced search.
comment: currently under review for IEEE Transactions on Cybernetics
☆ Deconfounded Causality-aware Parameter-Efficient Fine-Tuning for Problem-Solving Improvement of LLMs
Large Language Models (LLMs) have demonstrated remarkable efficiency in tackling various tasks based on human instructions, but recent studies reveal that these models often fail to achieve satisfactory results on questions involving reasoning, such as mathematics or physics questions. This phenomenon is usually attributed to the uncertainty regarding whether these models could genuinely comprehend the knowledge embedded in the text or merely learn to replicate the token distribution without a true understanding of the content. In this paper, we delve into this problem and aim to enhance the reasoning capabilities of LLMs. First, we investigate if the model has genuine reasoning capabilities by visualizing the text generation process at the attention and representation level. Then, we formulate the reasoning process of LLMs into a causal framework, which provides a formal explanation of the problems we observe in the visualization. Finally, building upon this causal framework, we propose Deconfounded Causal Adaptation (DCA), a novel parameter-efficient fine-tuning (PEFT) method to enhance the model's reasoning capabilities by encouraging the model to extract the general problem-solving skills and apply these skills to different questions. Experiments show that our method outperforms the baseline consistently across multiple benchmarks, and with only 1.2M tunable parameters, we achieve better or comparable results to other fine-tuning methods. This demonstrates the effectiveness and efficiency of our method in improving the overall accuracy and reliability of LLMs.
☆ Neural timescales from a computational perspective
Timescales of neural activity are diverse across and within brain areas, and experimental observations suggest that neural timescales reflect information in dynamic environments. However, these observations do not specify how neural timescales are shaped, nor whether particular timescales are necessary for neural computations and brain function. Here, we take a complementary perspective and synthesize three directions where computational methods can distill the broad set of empirical observations into quantitative and testable theories: We review (i) how data analysis methods allow us to capture different timescales of neural dynamics across different recording modalities, (ii) how computational models provide a mechanistic explanation for the emergence of diverse timescales, and (iii) how task-optimized models in machine learning uncover the functional relevance of neural timescales. This integrative computational approach, combined with empirical findings, would provide a more holistic understanding of how neural timescales capture the relationship between brain structure, dynamics, and behavior.
comment: 18 pages, 4 figures, 2 boxes
☆ Neural Networks with LSTM and GRU in Modeling Active Fires in the Amazon
This study presents a comprehensive methodology for modeling and forecasting the historical time series of fire spots detected by the AQUA_M-T satellite in the Amazon, Brazil. The approach utilizes a mixed Recurrent Neural Network (RNN) model, combining Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) architectures to predict monthly accumulations of daily detected fire spots. A summary of the data revealed a consistent seasonality over time, with annual maximum and minimum fire spot values tending to repeat at the same periods each year. The primary objective is to verify whether the forecasts capture this inherent seasonality through rigorous statistical analysis. The methodology involved careful data preparation, model configuration, and training using cross-validation with two seeds, ensuring that the data generalizes well to the test and validation sets, and confirming the convergence of the model parameters. The results indicate that the mixed LSTM and GRU model offers improved accuracy in forecasting 12 months ahead, demonstrating its effectiveness in capturing complex temporal patterns and modeling the observed time series. This research significantly contributes to the application of deep learning techniques in environmental monitoring, specifically in fire spot forecasting. In addition to improving forecast accuracy, the proposed approach highlights the potential for adaptation to other time series forecasting challenges, opening new avenues for research and development in machine learning and natural phenomenon prediction. Keywords: Time Series Forecasting, Recurrent Neural Networks, Deep Learning.
comment: 16 pages, in Portuguese language, 24 figures
☆ Independence Constrained Disentangled Representation Learning from Epistemological Perspective
Disentangled Representation Learning aims to improve the explainability of deep learning methods by training a data encoder that identifies semantically meaningful latent variables in the data generation process. Nevertheless, there is no consensus regarding a universally accepted definition for the objective of disentangled representation learning. In particular, there is a considerable amount of discourse regarding whether should the latent variables be mutually independent or not. In this paper, we first investigate these arguments on the interrelationships between latent variables by establishing a conceptual bridge between Epistemology and Disentangled Representation Learning. Then, inspired by these interdisciplinary concepts, we introduce a two-level latent space framework to provide a general solution to the prior arguments on this issue. Finally, we propose a novel method for disentangled representation learning by employing an integration of mutual information constraint and independence constraint within the Generative Adversarial Network (GAN) framework. Experimental results demonstrate that our proposed method consistently outperforms baseline approaches in both quantitative and qualitative evaluations. The method exhibits strong performance across multiple commonly used metrics and demonstrates a great capability in disentangling various semantic factors, leading to an improved quality of controllable generation, which consequently benefits the explainability of the algorithm.
☆ Causality-Aware Transformer Networks for Robotic Navigation
Recent advances in machine learning algorithms have garnered growing interest in developing versatile Embodied AI systems. However, current research in this domain reveals opportunities for improvement. First, the direct adoption of RNNs and Transformers often overlooks the specific differences between Embodied AI and traditional sequential data modelling, potentially limiting its performance in Embodied AI tasks. Second, the reliance on task-specific configurations, such as pre-trained modules and dataset-specific logic, compromises the generalizability of these methods. We address these constraints by initially exploring the unique differences between Embodied AI tasks and other sequential data tasks through the lens of Causality, presenting a causal framework to elucidate the inadequacies of conventional sequential methods for Embodied AI. By leveraging this causal perspective, we propose Causality-Aware Transformer (CAT) Networks for Navigation, featuring a Causal Understanding Module to enhance the models's Environmental Understanding capability. Meanwhile, our method is devoid of task-specific inductive biases and can be trained in an End-to-End manner, which enhances the method's generalizability across various contexts. Empirical evaluations demonstrate that our methodology consistently surpasses benchmark performances across a spectrum of settings, tasks and simulation environments. Extensive ablation studies reveal that the performance gains can be attributed to the Causal Understanding Module, which demonstrates effectiveness and efficiency in both Reinforcement Learning and Supervised Learning settings.
☆ Introduction to Machine Learning
This book introduces the mathematical foundations and techniques that lead to the development and analysis of many of the algorithms that are used in machine learning. It starts with an introductory chapter that describes notation used throughout the book and serve at a reminder of basic concepts in calculus, linear algebra and probability and also introduces some measure theoretic terminology, which can be used as a reading guide for the sections that use these tools. The introductory chapters also provide background material on matrix analysis and optimization. The latter chapter provides theoretical support to many algorithms that are used in the book, including stochastic gradient descent, proximal methods, etc. After discussing basic concepts for statistical prediction, the book includes an introduction to reproducing kernel theory and Hilbert space techniques, which are used in many places, before addressing the description of various algorithms for supervised statistical learning, including linear methods, support vector machines, decision trees, boosting, or neural networks. The subject then switches to generative methods, starting with a chapter that presents sampling methods and an introduction to the theory of Markov chains. The following chapter describe the theory of graphical models, an introduction to variational methods for models with latent variables, and to deep-learning based generative models. The next chapters focus on unsupervised learning methods, for clustering, factor analysis and manifold learning. The final chapter of the book is theory-oriented and discusses concentration inequalities and generalization bounds.
comment: textbook
☆ Learning-Based Error Detection System for Advanced Vehicle Instrument Cluster Rendering
The automotive industry is currently expanding digital display options with every new model that comes onto the market. This entails not just an expansion in dimensions, resolution, and customization choices, but also the capability to employ novel display effects like overlays while assembling the content of the display cluster. Unfortunately, this raises the need for appropriate monitoring systems that can detect rendering errors and apply appropriate countermeasures when required. Classical solutions such as Cyclic Redundancy Checks (CRC) will soon be no longer viable as any sort of alpha blending, warping of scaling of content can cause unwanted CRC violations. Therefore, we propose a novel monitoring approach to verify correctness of displayed content using telltales (e.g. warning signs) as example. It uses a learning-based approach to separate "good" telltales, i.e. those that a human driver will understand correctly, and "corrupted" telltales, i.e. those that will not be visible or perceived correctly. As a result, it possesses inherent resilience against individual pixel errors and implicitly supports changing backgrounds, overlay or scaling effects. This is underlined by our experimental study where all "corrupted" test patterns were correctly classified, while no false alarms were triggered.
comment: 9 pages
☆ Conformal Prediction in Dynamic Biological Systems
Uncertainty quantification (UQ) is the process of systematically determining and characterizing the degree of confidence in computational model predictions. In the context of systems biology, especially with dynamic models, UQ is crucial because it addresses the challenges posed by nonlinearity and parameter sensitivity, allowing us to properly understand and extrapolate the behavior of complex biological systems. Here, we focus on dynamic models represented by deterministic nonlinear ordinary differential equations. Many current UQ approaches in this field rely on Bayesian statistical methods. While powerful, these methods often require strong prior specifications and make parametric assumptions that may not always hold in biological systems. Additionally, these methods face challenges in domains where sample sizes are limited, and statistical inference becomes constrained, with computational speed being a bottleneck in large models of biological systems. As an alternative, we propose the use of conformal inference methods, introducing two novel algorithms that, in some instances, offer non-asymptotic guarantees, enhancing robustness and scalability across various applications. We demonstrate the efficacy of our proposed algorithms through several scenarios, highlighting their advantages over traditional Bayesian approaches. The proposed methods show promising results for diverse biological data structures and scenarios, offering a general framework to quantify uncertainty for dynamic models of biological systems.The software for the methodology and the reproduction of the results is available at https://zenodo.org/doi/10.5281/zenodo.13644870.
☆ AdvSecureNet: A Python Toolkit for Adversarial Machine Learning
Machine learning models are vulnerable to adversarial attacks. Several tools have been developed to research these vulnerabilities, but they often lack comprehensive features and flexibility. We introduce AdvSecureNet, a PyTorch based toolkit for adversarial machine learning that is the first to natively support multi-GPU setups for attacks, defenses, and evaluation. It is the first toolkit that supports both CLI and API interfaces and external YAML configuration files to enhance versatility and reproducibility. The toolkit includes multiple attacks, defenses and evaluation metrics. Rigiorous software engineering practices are followed to ensure high code quality and maintainability. The project is available as an open-source project on GitHub at https://github.com/melihcatal/advsecurenet and installable via PyPI.
☆ (Implicit) Ensembles of Ensembles: Epistemic Uncertainty Collapse in Large Models
Epistemic uncertainty is crucial for safety-critical applications and out-of-distribution detection tasks. Yet, we uncover a paradoxical phenomenon in deep learning models: an epistemic uncertainty collapse as model complexity increases, challenging the assumption that larger models invariably offer better uncertainty quantification. We propose that this stems from implicit ensembling within large models. To support this hypothesis, we demonstrate epistemic uncertainty collapse empirically across various architectures, from explicit ensembles of ensembles and simple MLPs to state-of-the-art vision models, including ResNets and Vision Transformers -- for the latter, we examine implicit ensemble extraction and decompose larger models into diverse sub-models, recovering epistemic uncertainty. We provide theoretical justification for these phenomena and explore their implications for uncertainty estimation.
comment: 10 pages
☆ Hypothesizing Missing Causal Variables with LLMs
Scientific discovery is a catalyst for human intellectual advances, driven by the cycle of hypothesis generation, experimental design, data evaluation, and iterative assumption refinement. This process, while crucial, is expensive and heavily dependent on the domain knowledge of scientists to generate hypotheses and navigate the scientific cycle. Central to this is causality, the ability to establish the relationship between the cause and the effect. Motivated by the scientific discovery process, in this work, we formulate a novel task where the input is a partial causal graph with missing variables, and the output is a hypothesis about the missing variables to complete the partial graph. We design a benchmark with varying difficulty levels and knowledge assumptions about the causal graph. With the growing interest in using Large Language Models (LLMs) to assist in scientific discovery, we benchmark open-source and closed models on our testbed. We show the strong ability of LLMs to hypothesize the mediation variables between a cause and its effect. In contrast, they underperform in hypothesizing the cause and effect variables themselves. We also observe surprising results where some of the open-source models outperform the closed GPT-4 model.
comment: Code - https://github.com/ivaxi0s/hypothesizing-causal-variable-llm
☆ A Fashion Item Recommendation Model in Hyperbolic Space CVPR 2024
In this work, we propose a fashion item recommendation model that incorporates hyperbolic geometry into user and item representations. Using hyperbolic space, our model aims to capture implicit hierarchies among items based on their visual data and users' purchase history. During training, we apply a multi-task learning framework that considers both hyperbolic and Euclidean distances in the loss function. Our experiments on three data sets show that our model performs better than previous models trained in Euclidean space only, confirming the effectiveness of our model. Our ablation studies show that multi-task learning plays a key role, and removing the Euclidean loss substantially deteriorates the model performance.
comment: This work was presented at the CVFAD Workshop at CVPR 2024
☆ An Analysis of Linear Complexity Attention Substitutes with BEST-RQ
Self-Supervised Learning (SSL) has proven to be effective in various domains, including speech processing. However, SSL is computationally and memory expensive. This is in part due the quadratic complexity of multi-head self-attention (MHSA). Alternatives for MHSA have been proposed and used in the speech domain, but have yet to be investigated properly in an SSL setting. In this work, we study the effects of replacing MHSA with recent state-of-the-art alternatives that have linear complexity, namely, HyperMixing, Fastformer, SummaryMixing, and Mamba. We evaluate these methods by looking at the speed, the amount of VRAM consumed, and the performance on the SSL MP3S benchmark. Results show that these linear alternatives maintain competitive performance compared to MHSA while, on average, decreasing VRAM consumption by around 20% to 60% and increasing speed from 7% to 65% for input sequences ranging from 20 to 80 seconds.
comment: Accepted in the IEEE Soken Language Technology Workshop 2024
☆ Multiview Random Vector Functional Link Network for Predicting DNA-Binding Proteins
The identification of DNA-binding proteins (DBPs) is a critical task due to their significant impact on various biological activities. Understanding the mechanisms underlying protein-DNA interactions is essential for elucidating various life activities. In recent years, machine learning-based models have been prominently utilized for DBP prediction. In this paper, to predict DBPs, we propose a novel framework termed a multiview random vector functional link (MvRVFL) network, which fuses neural network architecture with multiview learning. The proposed MvRVFL model combines the benefits of late and early fusion, allowing for distinct regularization parameters across different views while leveraging a closed-form solution to determine unknown parameters efficiently. The primal objective function incorporates a coupling term aimed at minimizing a composite of errors stemming from all views. From each of the three protein views of the DBP datasets, we extract five features. These features are then fused together by incorporating a hidden feature during the model training process. The performance of the proposed MvRVFL model on the DBP dataset surpasses that of baseline models, demonstrating its superior effectiveness. Furthermore, we extend our assessment to the UCI, KEEL, AwA, and Corel5k datasets, to establish the practicality of the proposed models. The consistency error bound, the generalization error bound, and empirical findings, coupled with rigorous statistical analyses, confirm the superior generalization capabilities of the MvRVFL model compared to the baseline models.
☆ BMI Prediction from Handwritten English Characters Using a Convolutional Neural Network
A person's Body Mass Index, or BMI, is the most widely used parameter for assessing their health. BMI is a crucial predictor of potential diseases that may arise at higher body fat levels because it is correlated with body fat. Conversely, a community's or an individual's nutritional status can be determined using the BMI. Although deep learning models are used in several studies to estimate BMI from face photos and other data, no previous research established a clear connection between deep learning techniques for handwriting analysis and BMI prediction. This article addresses this research gap with a deep learning approach to estimating BMI from handwritten characters by developing a convolutional neural network (CNN). A dataset containing samples from 48 people in lowercase English scripts is successfully captured for the BMI prediction task. The proposed CNN-based approach reports a commendable accuracy of 99.92%. Performance comparison with other popular CNN architectures reveals that AlexNet and InceptionV3 achieve the second and third-best performance, with the accuracy of 99.69% and 99.53%, respectively.
☆ Advancing Cyber Incident Timeline Analysis Through Rule Based AI and Large Language Models
Timeline Analysis (TA) is a key part of Timeline Forensics (TF) in Digital Forensics (DF), focusing primarily on examining and analysing temporal digital artefacts such as timestamps, derived from event logs, file metadata, and other related data to correlate events resulting from cyber incidents and reconstruct their chronological timeline. Traditional tools often struggle to efficiently process the vast volume and variety of data acquired during DF investigations and Incident Response (IR) processes. This paper presents a novel framework, GenDFIR, that combines Rule-Based Artificial Intelligence (R-BAI) algorithms with Large Language Models (LLMs) to advance and automate the TA process. Our approach consists of two main stages (1) We use R-BAI to identify and select anomalous digital artefacts based on predefined rules. (2) The selected artefacts are then converted into embeddings for processing by an LLM with the help of a Retrieval-Augmented Generation (RAG) agent. The LLM consequently leverages its capabilities to perform automated TA on the artefacts and predict potential incident scenarios. To validate our framework, we evaluate GenDFIR performance, efficiency, and reliability using various metrics across synthetic cyber incident simulation scenarios. This paper presents a proof of concept, where the findings demonstrate the significant potential of integrating R-BAI and LLMs for TA. This novel approach highlights the power of Generative AI (GenAI), specifically LLMs, and opens new avenues for advanced threat detection and incident reconstruction, representing a significant step forward in the field.
comment: 25 pages
☆ Low-Resolution Object Recognition with Cross-Resolution Relational Contrastive Distillation
Recognizing objects in low-resolution images is a challenging task due to the lack of informative details. Recent studies have shown that knowledge distillation approaches can effectively transfer knowledge from a high-resolution teacher model to a low-resolution student model by aligning cross-resolution representations. However, these approaches still face limitations in adapting to the situation where the recognized objects exhibit significant representation discrepancies between training and testing images. In this study, we propose a cross-resolution relational contrastive distillation approach to facilitate low-resolution object recognition. Our approach enables the student model to mimic the behavior of a well-trained teacher model which delivers high accuracy in identifying high-resolution objects. To extract sufficient knowledge, the student learning is supervised with contrastive relational distillation loss, which preserves the similarities in various relational structures in contrastive representation space. In this manner, the capability of recovering missing details of familiar low-resolution objects can be effectively enhanced, leading to a better knowledge transfer. Extensive experiments on low-resolution object classification and low-resolution face recognition clearly demonstrate the effectiveness and adaptability of our approach.
comment: This paper is accepted by IEEE Transactions on Circuits and Systems for Video Technology (TCSVT)
☆ Understanding eGFR Trajectories and Kidney Function Decline via Large Multimodal Models
The estimated Glomerular Filtration Rate (eGFR) is an essential indicator of kidney function in clinical practice. Although traditional equations and Machine Learning (ML) models using clinical and laboratory data can estimate eGFR, accurately predicting future eGFR levels remains a significant challenge for nephrologists and ML researchers. Recent advances demonstrate that Large Language Models (LLMs) and Large Multimodal Models (LMMs) can serve as robust foundation models for diverse applications. This study investigates the potential of LMMs to predict future eGFR levels with a dataset consisting of laboratory and clinical values from 50 patients. By integrating various prompting techniques and ensembles of LMMs, our findings suggest that these models, when combined with precise prompts and visual representations of eGFR trajectories, offer predictive performance comparable to existing ML models. This research extends the application of foundation models and suggests avenues for future studies to harness these models in addressing complex medical forecasting challenges.
comment: This preprint version includes corrections of typographical errors related to numerical values in Table 2, which were present in the version published at the BDH workshop in MIPR 2024. These corrections do not affect the overall conclusions of the study
☆ Sample what you cant compress
For learned image representations, basic autoencoders often produce blurry results. Reconstruction quality can be improved by incorporating additional penalties such as adversarial (GAN) and perceptual losses. Arguably, these approaches lack a principled interpretation. Concurrently, in generative settings diffusion has demonstrated a remarkable ability to create crisp, high quality results and has solid theoretical underpinnings (from variational inference to direct study as the Fisher Divergence). Our work combines autoencoder representation learning with diffusion and is, to our knowledge, the first to demonstrate the efficacy of jointly learning a continuous encoder and decoder under a diffusion-based loss. We demonstrate that this approach yields better reconstruction quality as compared to GAN-based autoencoders while being easier to tune. We also show that the resulting representation is easier to model with a latent diffusion model as compared to the representation obtained from a state-of-the-art GAN-based loss. Since our decoder is stochastic, it can generate details not encoded in the otherwise deterministic latent representation; we therefore name our approach "Sample what you can't compress", or SWYCC for short.
☆ Training Universal Vocoders with Feature Smoothing-Based Augmentation Methods for High-Quality TTS Systems
While universal vocoders have achieved proficient waveform generation across diverse voices, their integration into text-to-speech (TTS) tasks often results in degraded synthetic quality. To address this challenge, we present a novel augmentation technique for training universal vocoders. Our training scheme randomly applies linear smoothing filters to input acoustic features, facilitating vocoder generalization across a wide range of smoothings. It significantly mitigates the training-inference mismatch, enhancing the naturalness of synthetic output even when the acoustic model produces overly smoothed features. Notably, our method is applicable to any vocoder without requiring architectural modifications or dependencies on specific acoustic models. The experimental results validate the superiority of our vocoder over conventional methods, achieving 11.99% and 12.05% improvements in mean opinion scores when integrated with Tacotron 2 and FastSpeech 2 TTS acoustic models, respectively.
comment: 4 pages, 4 figures, for demo samples, see https://sytronik.github.io/demos/voc_smth_aug/
☆ Continual Diffuser (CoD): Mastering Continual Offline Reinforcement Learning with Experience Rehearsal
Artificial neural networks, especially recent diffusion-based models, have shown remarkable superiority in gaming, control, and QA systems, where the training tasks' datasets are usually static. However, in real-world applications, such as robotic control of reinforcement learning (RL), the tasks are changing, and new tasks arise in a sequential order. This situation poses the new challenge of plasticity-stability trade-off for training an agent who can adapt to task changes and retain acquired knowledge. In view of this, we propose a rehearsal-based continual diffusion model, called Continual Diffuser (CoD), to endow the diffuser with the capabilities of quick adaptation (plasticity) and lasting retention (stability). Specifically, we first construct an offline benchmark that contains 90 tasks from multiple domains. Then, we train the CoD on each task with sequential modeling and conditional generation for making decisions. Next, we preserve a small portion of previous datasets as the rehearsal buffer and replay it to retain the acquired knowledge. Extensive experiments on a series of tasks show CoD can achieve a promising plasticity-stability trade-off and outperform existing diffusion-based methods and other representative baselines on most tasks.
☆ CoAst: Validation-Free Contribution Assessment for Federated Learning based on Cross-Round Valuation
In the federated learning (FL) process, since the data held by each participant is different, it is necessary to figure out which participant has a higher contribution to the model performance. Effective contribution assessment can help motivate data owners to participate in the FL training. Research works in this field can be divided into two directions based on whether a validation dataset is required. Validation-based methods need to use representative validation data to measure the model accuracy, which is difficult to obtain in practical FL scenarios. Existing validation-free methods assess the contribution based on the parameters and gradients of local models and the global model in a single training round, which is easily compromised by the stochasticity of model training. In this work, we propose CoAst, a practical method to assess the FL participants' contribution without access to any validation data. The core idea of CoAst involves two aspects: one is to only count the most important part of model parameters through a weights quantization, and the other is a cross-round valuation based on the similarity between the current local parameters and the global parameter updates in several subsequent communication rounds. Extensive experiments show that CoAst has comparable assessment reliability to existing validation-based methods and outperforms existing validation-free methods.
☆ Reliable Deep Diffusion Tensor Estimation: Rethinking the Power of Data-Driven Optimization Routine
Diffusion tensor imaging (DTI) holds significant importance in clinical diagnosis and neuroscience research. However, conventional model-based fitting methods often suffer from sensitivity to noise, leading to decreased accuracy in estimating DTI parameters. While traditional data-driven deep learning methods have shown potential in terms of accuracy and efficiency, their limited generalization to out-of-training-distribution data impedes their broader application due to the diverse scan protocols used across centers, scanners, and studies. This work aims to tackle these challenges and promote the use of DTI by introducing a data-driven optimization-based method termed DoDTI. DoDTI combines the weighted linear least squares fitting algorithm and regularization by denoising technique. The former fits DW images from diverse acquisition settings into diffusion tensor field, while the latter applies a deep learning-based denoiser to regularize the diffusion tensor field instead of the DW images, which is free from the limitation of fixed-channel assignment of the network. The optimization object is solved using the alternating direction method of multipliers and then unrolled to construct a deep neural network, leveraging a data-driven strategy to learn network parameters. Extensive validation experiments are conducted utilizing both internally simulated datasets and externally obtained in-vivo datasets. The results, encompassing both qualitative and quantitative analyses, showcase that the proposed method attains state-of-the-art performance in DTI parameter estimation. Notably, it demonstrates superior generalization, accuracy, and efficiency, rendering it highly reliable for widespread application in the field.
☆ Adversarial Attacks on Machine Learning-Aided Visualizations
Research in ML4VIS investigates how to use machine learning (ML) techniques to generate visualizations, and the field is rapidly growing with high societal impact. However, as with any computational pipeline that employs ML processes, ML4VIS approaches are susceptible to a range of ML-specific adversarial attacks. These attacks can manipulate visualization generations, causing analysts to be tricked and their judgments to be impaired. Due to a lack of synthesis from both visualization and ML perspectives, this security aspect is largely overlooked by the current ML4VIS literature. To bridge this gap, we investigate the potential vulnerabilities of ML-aided visualizations from adversarial attacks using a holistic lens of both visualization and ML perspectives. We first identify the attack surface (i.e., attack entry points) that is unique in ML-aided visualizations. We then exemplify five different adversarial attacks. These examples highlight the range of possible attacks when considering the attack surface and multiple different adversary capabilities. Our results show that adversaries can induce various attacks, such as creating arbitrary and deceptive visualizations, by systematically identifying input attributes that are influential in ML inferences. Based on our observations of the attack surface characteristics and the attack examples, we underline the importance of comprehensive studies of security issues and defense mechanisms as a call of urgency for the ML4VIS community.
comment: This is the author's version of the article that has been accepted by the Journal of Visualization
☆ Volumetric Surfaces: Representing Fuzzy Geometries with Multiple Meshes
High-quality real-time view synthesis methods are based on volume rendering, splatting, or surface rendering. While surface-based methods generally are the fastest, they cannot faithfully model fuzzy geometry like hair. In turn, alpha-blending techniques excel at representing fuzzy materials but require an unbounded number of samples per ray (P1). Further overheads are induced by empty space skipping in volume rendering (P2) and sorting input primitives in splatting (P3). These problems are exacerbated on low-performance graphics hardware, e.g. on mobile devices. We present a novel representation for real-time view synthesis where the (P1) number of sampling locations is small and bounded, (P2) sampling locations are efficiently found via rasterization, and (P3) rendering is sorting-free. We achieve this by representing objects as semi-transparent multi-layer meshes, rendered in fixed layer order from outermost to innermost. We model mesh layers as SDF shells with optimal spacing learned during training. After baking, we fit UV textures to the corresponding meshes. We show that our method can represent challenging fuzzy objects while achieving higher frame rates than volume-based and splatting-based methods on low-end and mobile devices.
☆ Demographic parity in regression and classification within the unawareness framework
This paper explores the theoretical foundations of fair regression under the constraint of demographic parity within the unawareness framework, where disparate treatment is prohibited, extending existing results where such treatment is permitted. Specifically, we aim to characterize the optimal fair regression function when minimizing the quadratic loss. Our results reveal that this function is given by the solution to a barycenter problem with optimal transport costs. Additionally, we study the connection between optimal fair cost-sensitive classification, and optimal fair regression. We demonstrate that nestedness of the decision sets of the classifiers is both necessary and sufficient to establish a form of equivalence between classification and regression. Under this nestedness assumption, the optimal classifiers can be derived by applying thresholds to the optimal fair regression function; conversely, the optimal fair regression function is characterized by the family of cost-sensitive classifiers.
☆ ForeCal: Random Forest-based Calibration for DNNs
Deep neural network(DNN) based classifiers do extremely well in discriminating between observations, resulting in higher ROC AUC and accuracy metrics, but their outputs are often miscalibrated with respect to true event likelihoods. Post-hoc calibration algorithms are often used to calibrate the outputs of these classifiers. Methods like Isotonic regression, Platt scaling, and Temperature scaling have been shown to be effective in some cases but are limited by their parametric assumptions and/or their inability to capture complex non-linear relationships. We propose ForeCal - a novel post-hoc calibration algorithm based on Random forests. ForeCal exploits two unique properties of Random forests: the ability to enforce weak monotonicity and range-preservation. It is more powerful in achieving calibration than current state-of-the-art methods, is non-parametric, and can incorporate exogenous information as features to learn a better calibration function. Through experiments on 43 diverse datasets from the UCI ML repository, we show that ForeCal outperforms existing methods in terms of Expected Calibration Error(ECE) with minimal impact on the discriminative power of the base DNN as measured by AUC.
☆ Adversarial Learning for Neural PDE Solvers with Sparse Data
Neural network solvers for partial differential equations (PDEs) have made significant progress, yet they continue to face challenges related to data scarcity and model robustness. Traditional data augmentation methods, which leverage symmetry or invariance, impose strong assumptions on physical systems that often do not hold in dynamic and complex real-world applications. To address this research gap, this study introduces a universal learning strategy for neural network PDEs, named Systematic Model Augmentation for Robust Training (SMART). By focusing on challenging and improving the model's weaknesses, SMART reduces generalization error during training under data-scarce conditions, leading to significant improvements in prediction accuracy across various PDE scenarios. The effectiveness of the proposed method is demonstrated through both theoretical analysis and extensive experimentation. The code will be available.
☆ Transfer-based Adversarial Poisoning Attacks for Online (MIMO-)Deep Receviers
Recently, the design of wireless receivers using deep neural networks (DNNs), known as deep receivers, has attracted extensive attention for ensuring reliable communication in complex channel environments. To adapt quickly to dynamic channels, online learning has been adopted to update the weights of deep receivers with over-the-air data (e.g., pilots). However, the fragility of neural models and the openness of wireless channels expose these systems to malicious attacks. To this end, understanding these attack methods is essential for robust receiver design.In this paper, we propose a transfer-based adversarial poisoning attack method for online receivers.Without knowledge of the attack target, adversarial perturbations are injected to the pilots, poisoning the online deep receiver and impairing its ability to adapt to dynamic channels and nonlinear effects. In particular, our attack method targets Deep Soft Interference Cancellation (DeepSIC)[1] using online meta-learning.As a classical model-driven deep receiver, DeepSIC incorporates wireless domain knowledge into its architecture. This integration allows it to adapt efficiently to time-varying channels with only a small number of pilots, achieving optimal performance in a multi-input and multi-output (MIMO) scenario.The deep receiver in this scenario has a number of applications in the field of wireless communication, which motivates our study of the attack methods targeting it.Specifically, we demonstrate the effectiveness of our attack in simulations on synthetic linear, synthetic nonlinear, static, and COST 2100 channels. Simulation results indicate that the proposed poisoning attack significantly reduces the performance of online receivers in rapidly changing scenarios.
comment: 15 pages, 14 figures
☆ Large Language Models as Efficient Reward Function Searchers for Custom-Environment Multi-Objective Reinforcement Learning
Leveraging large language models (LLMs) for designing reward functions demonstrates significant potential. However, achieving effective design and improvement of reward functions in reinforcement learning (RL) tasks with complex custom environments and multiple requirements presents considerable challenges. In this paper, we enable LLMs to be effective white-box searchers, highlighting their advanced semantic understanding capabilities. Specifically, we generate reward components for each explicit user requirement and employ the reward critic to identify the correct code form. Then, LLMs assign weights to the reward components to balance their values and iteratively search and optimize these weights based on the context provided by the training log analyzer, while adaptively determining the search step size. We applied the framework to an underwater information collection RL task without direct human feedback or reward examples (zero-shot). The reward critic successfully correct the reward code with only one feedback for each requirement, effectively preventing irreparable errors that can occur when reward function feedback is provided in aggregate. The effective initialization of weights enables the acquisition of different reward functions within the Pareto solution set without weight search. Even in the case where a weight is 100 times off, fewer than four iterations are needed to obtain solutions that meet user requirements. The framework also works well with most prompts utilizing GPT-3.5 Turbo, since it does not require advanced numerical understanding or calculation.
☆ Diffusion Models Learn Low-Dimensional Distributions via Subspace Clustering
Recent empirical studies have demonstrated that diffusion models can effectively learn the image distribution and generate new samples. Remarkably, these models can achieve this even with a small number of training samples despite a large image dimension, circumventing the curse of dimensionality. In this work, we provide theoretical insights into this phenomenon by leveraging key empirical observations: (i) the low intrinsic dimensionality of image data, (ii) a union of manifold structure of image data, and (iii) the low-rank property of the denoising autoencoder in trained diffusion models. These observations motivate us to assume the underlying data distribution of image data as a mixture of low-rank Gaussians and to parameterize the denoising autoencoder as a low-rank model according to the score function of the assumed distribution. With these setups, we rigorously show that optimizing the training loss of diffusion models is equivalent to solving the canonical subspace clustering problem over the training samples. Based on this equivalence, we further show that the minimal number of samples required to learn the underlying distribution scales linearly with the intrinsic dimensions under the above data and model assumptions. This insight sheds light on why diffusion models can break the curse of dimensionality and exhibit the phase transition in learning distributions. Moreover, we empirically establish a correspondence between the subspaces and the semantic representations of image data, facilitating image editing. We validate these results with corroborated experimental results on both simulated distributions and image datasets.
comment: 39 pages, 9 figures
☆ Deep Adaptive Interest Network: Personalized Recommendation with Context-Aware Learning
In personalized recommendation systems, accurately capturing users' evolving interests and combining them with contextual information is a critical research area. This paper proposes a novel model called the Deep Adaptive Interest Network (DAIN), which dynamically models users' interests while incorporating context-aware learning mechanisms to achieve precise and adaptive personalized recommendations. DAIN leverages deep learning techniques to build an adaptive interest network structure that can capture users' interest changes in real-time while further optimizing recommendation results by integrating contextual information. Experiments conducted on several public datasets demonstrate that DAIN excels in both recommendation performance and computational efficiency. This research not only provides a new solution for personalized recommendation systems but also offers fresh insights into the application of context-aware learning in recommendation systems.
☆ Relative-Translation Invariant Wasserstein Distance
We introduce a new family of distances, relative-translation invariant Wasserstein distances ($RW_p$), for measuring the similarity of two probability distributions under distribution shift. Generalizing it from the classical optimal transport model, we show that $RW_p$ distances are also real distance metrics defined on the quotient set $\mathcal{P}_p(\mathbb{R}^n)/\sim$ and invariant to distribution translations. When $p=2$, the $RW_2$ distance enjoys more exciting properties, including decomposability of the optimal transport model, translation-invariance of the $RW_2$ distance, and a Pythagorean relationship between $RW_2$ and the classical quadratic Wasserstein distance ($W_2$). Based on these properties, we show that a distribution shift, measured by $W_2$ distance, can be explained in the bias-variance perspective. In addition, we propose a variant of the Sinkhorn algorithm, named $RW_2$ Sinkhorn algorithm, for efficiently calculating $RW_2$ distance, coupling solutions, as well as $W_2$ distance. We also provide the analysis of numerical stability and time complexity for the proposed algorithm. Finally, we validate the $RW_2$ distance metric and the algorithm performance with three experiments. We conduct one numerical validation for the $RW_2$ Sinkhorn algorithm and show two real-world applications demonstrating the effectiveness of using $RW_2$ under distribution shift: digits recognition and similar thunderstorm detection. The experimental results report that our proposed algorithm significantly improves the computational efficiency of Sinkhorn in certain practical applications, and the $RW_2$ distance is robust to distribution translations compared with baselines.
☆ Abstractive Text Summarization: State of the Art, Challenges, and Improvements
Specifically focusing on the landscape of abstractive text summarization, as opposed to extractive techniques, this survey presents a comprehensive overview, delving into state-of-the-art techniques, prevailing challenges, and prospective research directions. We categorize the techniques into traditional sequence-to-sequence models, pre-trained large language models, reinforcement learning, hierarchical methods, and multi-modal summarization. Unlike prior works that did not examine complexities, scalability and comparisons of techniques in detail, this review takes a comprehensive approach encompassing state-of-the-art methods, challenges, solutions, comparisons, limitations and charts out future improvements - providing researchers an extensive overview to advance abstractive summarization research. We provide vital comparison tables across techniques categorized - offering insights into model complexity, scalability and appropriate applications. The paper highlights challenges such as inadequate meaning representation, factual consistency, controllable text summarization, cross-lingual summarization, and evaluation metrics, among others. Solutions leveraging knowledge incorporation and other innovative strategies are proposed to address these challenges. The paper concludes by highlighting emerging research areas like factual inconsistency, domain-specific, cross-lingual, multilingual, and long-document summarization, as well as handling noisy data. Our objective is to provide researchers and practitioners with a structured overview of the domain, enabling them to better understand the current landscape and identify potential areas for further research and improvement.
comment: 9 Tables, 7 Figures
☆ Adaptive Class Emergence Training: Enhancing Neural Network Stability and Generalization through Progressive Target Evolution
Recent advancements in artificial intelligence, particularly deep neural networks, have pushed the boundaries of what is achievable in complex tasks. Traditional methods for training neural networks in classification problems often rely on static target outputs, such as one-hot encoded vectors, which can lead to unstable optimization and difficulties in handling non-linearities within data. In this paper, we propose a novel training methodology that progressively evolves the target outputs from a null vector to one-hot encoded vectors throughout the training process. This gradual transition allows the network to adapt more smoothly to the increasing complexity of the classification task, maintaining an equilibrium state that reduces the risk of overfitting and enhances generalization. Our approach, inspired by concepts from structural equilibrium in finite element analysis, has been validated through extensive experiments on both synthetic and real-world datasets. The results demonstrate that our method achieves faster convergence, improved accuracy, and better generalization, especially in scenarios with high data complexity and noise. This progressive training framework offers a robust alternative to classical methods, opening new perspectives for more efficient and stable neural network training.
comment: 15 pages, 9 figures, 2 tables
☆ Learning Privacy-Preserving Student Networks via Discriminative-Generative Distillation
While deep models have proved successful in learning rich knowledge from massive well-annotated data, they may pose a privacy leakage risk in practical deployment. It is necessary to find an effective trade-off between high utility and strong privacy. In this work, we propose a discriminative-generative distillation approach to learn privacy-preserving deep models. Our key idea is taking models as bridge to distill knowledge from private data and then transfer it to learn a student network via two streams. First, discriminative stream trains a baseline classifier on private data and an ensemble of teachers on multiple disjoint private subsets, respectively. Then, generative stream takes the classifier as a fixed discriminator and trains a generator in a data-free manner. After that, the generator is used to generate massive synthetic data which are further applied to train a variational autoencoder (VAE). Among these synthetic data, a few of them are fed into the teacher ensemble to query labels via differentially private aggregation, while most of them are embedded to the trained VAE for reconstructing synthetic data. Finally, a semi-supervised student learning is performed to simultaneously handle two tasks: knowledge transfer from the teachers with distillation on few privately labeled synthetic data, and knowledge enhancement with tangent-normal adversarial regularization on many triples of reconstructed synthetic data. In this way, our approach can control query cost over private data and mitigate accuracy degradation in a unified manner, leading to a privacy-preserving student model. Extensive experiments and analysis clearly show the effectiveness of the proposed approach.
comment: This paper is accepted by IEEE Transactions on Image Processing (TIP)
☆ Building Math Agents with Multi-Turn Iterative Preference Learning
Recent studies have shown that large language models' (LLMs) mathematical problem-solving capabilities can be enhanced by integrating external tools, such as code interpreters, and employing multi-turn Chain-of-Thought (CoT) reasoning. While current methods focus on synthetic data generation and Supervised Fine-Tuning (SFT), this paper studies the complementary direct preference learning approach to further improve model performance. However, existing direct preference learning algorithms are originally designed for the single-turn chat task, and do not fully address the complexities of multi-turn reasoning and external tool integration required for tool-integrated mathematical reasoning tasks. To fill in this gap, we introduce a multi-turn direct preference learning framework, tailored for this context, that leverages feedback from code interpreters and optimizes trajectory-level preferences. This framework includes multi-turn DPO and multi-turn KTO as specific implementations. The effectiveness of our framework is validated through training of various language models using an augmented prompt set from the GSM8K and MATH datasets. Our results demonstrate substantial improvements: a supervised fine-tuned Gemma-1.1-it-7B model's performance increased from 77.5% to 83.9% on GSM8K and from 46.1% to 51.2% on MATH. Similarly, a Gemma-2-it-9B model improved from 84.1% to 86.3% on GSM8K and from 51.0% to 54.5% on MATH.
comment: A multi-turn direct preference learning framework for tool-integrated reasoning tasks
☆ Gaussian Rate-Distortion-Perception Coding and Entropy-Constrained Scalar Quantization
This paper investigates the best known bounds on the quadratic Gaussian distortion-rate-perception function with limited common randomness for the Kullback-Leibler divergence-based perception measure, as well as their counterparts for the squared Wasserstein-2 distance-based perception measure, recently established by Xie et al. These bounds are shown to be nondegenerate in the sense that they cannot be deduced from each other via a refined version of Talagrand's transportation inequality. On the other hand, an improved lower bound is established when the perception measure is given by the squared Wasserstein-2 distance. In addition, it is revealed by exploiting the connection between rate-distortion-perception coding and entropy-constrained scalar quantization that all the aforementioned bounds are generally not tight in the weak perception constraint regime.
☆ Exploring Low-Dimensional Subspaces in Diffusion Models for Controllable Image Editing
Recently, diffusion models have emerged as a powerful class of generative models. Despite their success, there is still limited understanding of their semantic spaces. This makes it challenging to achieve precise and disentangled image generation without additional training, especially in an unsupervised way. In this work, we improve the understanding of their semantic spaces from intriguing observations: among a certain range of noise levels, (1) the learned posterior mean predictor (PMP) in the diffusion model is locally linear, and (2) the singular vectors of its Jacobian lie in low-dimensional semantic subspaces. We provide a solid theoretical basis to justify the linearity and low-rankness in the PMP. These insights allow us to propose an unsupervised, single-step, training-free LOw-rank COntrollable image editing (LOCO Edit) method for precise local editing in diffusion models. LOCO Edit identified editing directions with nice properties: homogeneity, transferability, composability, and linearity. These properties of LOCO Edit benefit greatly from the low-dimensional semantic subspace. Our method can further be extended to unsupervised or text-supervised editing in various text-to-image diffusion models (T-LOCO Edit). Finally, extensive empirical experiments demonstrate the effectiveness and efficiency of LOCO Edit. The codes will be released at https://github.com/ChicyChen/LOCO-Edit.
☆ Optimal Neural Network Approximation for High-Dimensional Continuous Functions
Recently, the authors of Shen Yang Zhang (JMLR, 2022) developed a neural network with width $36d(2d + 1)$ and depth $11$, which utilizes a special activation function called the elementary universal activation function, to achieve the super approximation property for functions in $C([a,b]^d)$. That is, the constructed network only requires a fixed number of neurons to approximate a $d$-variate continuous function on a $d$-dimensional hypercube with arbitrary accuracy. Their network uses $\mathcal{O}(d^2)$ fixed neurons. One natural question to address is whether we can reduce the number of these neurons in such a network. By leveraging a variant of the Kolmogorov Superposition Theorem, our analysis shows that there is a neural network generated by the elementary universal activation function with only $366d +365$ fixed, intrinsic (non-repeated) neurons that attains this super approximation property. Furthermore, we present a family of continuous functions that requires at least width $d$, and therefore at least $d$ intrinsic neurons, to achieve arbitrary accuracy in its approximation. This shows that the requirement of $\mathcal{O}(d)$ intrinsic neurons is optimal in the sense that it grows linearly with the input dimension $d$, unlike some approximation methods where parameters may grow exponentially with $d$.
☆ Machine Learning Applications to Computational Plasma Physics and Reduced-Order Plasma Modeling: A Perspective
Machine learning (ML) provides a broad spectrum of tools and architectures that enable the transformation of data from simulations and experiments into useful and explainable science, thereby augmenting domain knowledge. Furthermore, ML-enhanced numerical modelling can revamp scientific computing for real-world complex engineering systems, creating unique opportunities to examine the operation of the technologies in detail and automate their optimization and control. In recent years, ML applications have seen significant growth across various scientific domains, particularly in fluid mechanics, where ML has shown great promise in enhancing computational modeling of fluid flows. In contrast, ML applications in numerical plasma physics research remain relatively limited in scope and extent. Despite this, the close relationship between fluid mechanics and plasma physics presents a valuable opportunity to create a roadmap for transferring ML advances in fluid flow modeling to computational plasma physics. This Perspective aims to outline such a roadmap. We begin by discussing some general fundamental aspects of ML, including the various categories of ML algorithms and the different types of problems that can be solved with the help of ML. With regard to each problem type, we then present specific examples from the use of ML in computational fluid dynamics, reviewing several insightful prior efforts. We also review recent ML applications in plasma physics for each problem type. The paper discusses promising future directions and development pathways for ML in plasma modelling within the different application areas. Additionally, we point out prominent challenges that must be addressed to realize ML's full potential in computational plasma physics, including the need for cost-effective high-fidelity simulation tools for extensive data generation.
comment: 42 pages, 20 figures
☆ Understanding the Role of Functional Diversity in Weight-Ensembling with Ingredient Selection and Multidimensional Scaling ICML 2024
Weight-ensembles are formed when the parameters of multiple neural networks are directly averaged into a single model. They have demonstrated generalization capability in-distribution (ID) and out-of-distribution (OOD) which is not completely understood, though they are thought to successfully exploit functional diversity allotted by each distinct model. Given a collection of models, it is also unclear which combination leads to the optimal weight-ensemble; the SOTA is a linear-time ``greedy" method. We introduce two novel weight-ensembling approaches to study the link between performance dynamics and the nature of how each method decides to use apply the functionally diverse components, akin to diversity-encouragement in the prediction-ensemble literature. We develop a visualization tool to explain how each algorithm explores various domains defined via pairwise-distances to further investigate selection and algorithms' convergence. Empirical analyses shed perspectives which reinforce how high-diversity enhances weight-ensembling while qualifying the extent to which diversity alone improves accuracy. We also demonstrate that sampling positionally distinct models can contribute just as meaningfully to improvements in a weight-ensemble.
comment: Published at the ICML 2024 (Vienna, Austria) Workshop on Foundation Models in the Wild
☆ Robust Federated Finetuning of Foundation Models via Alternating Minimization of LoRA ICML2024
Parameter-Efficient Fine-Tuning (PEFT) has risen as an innovative training strategy that updates only a select few model parameters, significantly lowering both computational and memory demands. PEFT also helps to decrease data transfer in federated learning settings, where communication depends on the size of updates. In this work, we explore the constraints of previous studies that integrate a well-known PEFT method named LoRA with federated fine-tuning, then introduce RoLoRA, a robust federated fine-tuning framework that utilizes an alternating minimization approach for LoRA, providing greater robustness against decreasing fine-tuning parameters and increasing data heterogeneity. Our results indicate that RoLoRA not only presents the communication benefits but also substantially enhances the robustness and effectiveness in multiple federated fine-tuning scenarios.
comment: Presented at ES-FOMO-II@ICML2024
☆ NUDGE: Lightweight Non-Parametric Fine-Tuning of Embeddings for Retrieval
$k$-Nearest Neighbor search on dense vector embeddings ($k$-NN retrieval) from pre-trained embedding models is the predominant retrieval method for text and images, as well as Retrieval-Augmented Generation (RAG) pipelines. In practice, application developers often fine-tune the embeddings to improve their accuracy on the dataset and query workload in hand. Existing approaches either fine-tune the pre-trained model itself or, more efficiently, but at the cost of accuracy, train adaptor models to transform the output of the pre-trained model. We present NUDGE, a family of novel non-parametric embedding fine-tuning approaches that are significantly more accurate and efficient than both sets of existing approaches. NUDGE directly modifies the embeddings of data records to maximize the accuracy of $k$-NN retrieval. We present a thorough theoretical and experimental study of NUDGE's non-parametric approach. We show that even though the underlying problem is NP-Hard, constrained variations can be solved efficiently. These constraints additionally ensure that the changes to the embeddings are modest, avoiding large distortions to the semantics learned during pre-training. In experiments across five pre-trained models and nine standard text and image retrieval datasets, NUDGE runs in minutes and often improves NDCG@10 by more than 10% over existing fine-tuning methods. On average, NUDGE provides 3.3x and 4.3x higher increase in accuracy and runs 200x and 3x faster, respectively, over fine-tuning the pre-trained model and training adaptors.
☆ Optimal sampling for least-squares approximation
Least-squares approximation is one of the most important methods for recovering an unknown function from data. While in many applications the data is fixed, in many others there is substantial freedom to choose where to sample. In this paper, we review recent progress on optimal sampling for (weighted) least-squares approximation in arbitrary linear spaces. We introduce the Christoffel function as a key quantity in the analysis of (weighted) least-squares approximation from random samples, then show how it can be used to construct sampling strategies that possess near-optimal sample complexity: namely, the number of samples scales log-linearly in $n$, the dimension of the approximation space. We discuss a series of variations, extensions and further topics, and throughout highlight connections to approximation theory, machine learning, information-based complexity and numerical linear algebra. Finally, motivated by various contemporary applications, we consider a generalization of the classical setting where the samples need not be pointwise samples of a scalar-valued function, and the approximation space need not be linear. We show that even in this significantly more general setting suitable generalizations of the Christoffel function still determine the sample complexity. This provides a unified procedure for designing improved sampling strategies for general recovery problems. This article is largely self-contained, and intended to be accessible to nonspecialists.
☆ Data-driven 2D stationary quantum droplets and wave propagations in the amended GP equation with two potentials via deep neural networks learning
In this paper, we develop a systematic deep learning approach to solve two-dimensional (2D) stationary quantum droplets (QDs) and investigate their wave propagation in the 2D amended Gross-Pitaevskii equation with Lee-Huang-Yang correction and two kinds of potentials. Firstly, we use the initial-value iterative neural network (IINN) algorithm for 2D stationary quantum droplets of stationary equations. Then the learned stationary QDs are used as the initial value conditions for physics-informed neural networks (PINNs) to explore their evolutions in the some space-time region. Especially, we consider two types of potentials, one is the 2D quadruple-well Gaussian potential and the other is the PT-symmetric HO-Gaussian potential, which lead to spontaneous symmetry breaking and the generation of multi-component QDs. The used deep learning method can also be applied to study wave propagations of other nonlinear physical models.
comment: 17 pages, 12 figures (Proc. R. Soc. A, accepted for publication). arXiv admin note: text overlap with arXiv:2409.01124
♻ ☆ Enhancing Graph Neural Networks with Limited Labeled Data by Actively Distilling Knowledge from Large Language Models
Graphs are pervasive in the real-world, such as social network analysis, bioinformatics, and knowledge graphs. Graph neural networks (GNNs) have great ability in node classification, a fundamental task on graphs. Unfortunately, conventional GNNs still face challenges in scenarios with few labeled nodes, despite the prevalence of few-shot node classification tasks in real-world applications. To address this challenge, various approaches have been proposed, including graph meta-learning, transfer learning, and methods based on Large Language Models (LLMs). However, traditional meta-learning and transfer learning methods often require prior knowledge from base classes or fail to exploit the potential advantages of unlabeled nodes. Meanwhile, LLM-based methods may overlook the zero-shot capabilities of LLMs and rely heavily on the quality of generated contexts. In this paper, we propose a novel approach that integrates LLMs and GNNs, leveraging the zero-shot inference and reasoning capabilities of LLMs and employing a Graph-LLM-based active learning paradigm to enhance GNNs' performance. Extensive experiments demonstrate the effectiveness of our model in improving node classification accuracy with considerably limited labeled data, surpassing state-of-the-art baselines by significant margins.
comment: 10 pages, 3 Figures
♻ ☆ Decentralized Intelligence Network (DIN)
Decentralized Intelligence Network (DIN) is a theoretical framework designed to address challenges in AI development, particularly focusing on data fragmentation and siloing issues. It facilitates effective AI training within sovereign data networks by overcoming barriers to accessing diverse data sources, leveraging: 1) personal data stores to ensure data sovereignty, where data remains securely within Participants' control; 2) a scalable federated learning protocol implemented on a public blockchain for decentralized AI training, where only model parameter updates are shared, keeping data within the personal data stores; and 3) a scalable, trustless cryptographic rewards mechanism on a public blockchain to incentivize participation and ensure fair reward distribution through a decentralized auditing protocol. This approach guarantees that no entity can prevent or control access to training data or influence financial benefits, as coordination and reward distribution are managed on the public blockchain with an immutable record. The framework supports effective AI training by allowing Participants to maintain control over their data, benefit financially, and contribute to a decentralized, scalable ecosystem that leverages collective AI to develop beneficial algorithms.
comment: 16 pages, 1 figure. DIN was presented by the author as a speaker at the Summit on Responsible Decentralized Intelligence - Future of Decentralization and AI, hosted by Berkeley RDI on August 6, 2024, at the Verizon Center, Cornell Tech Campus, Roosevelt Island, NYC
♻ ☆ Kolmogorov n-Widths for Multitask Physics-Informed Machine Learning (PIML) Methods: Towards Robust Metrics
Physics-informed machine learning (PIML) as a means of solving partial differential equations (PDE) has garnered much attention in the Computational Science and Engineering (CS&E) world. This topic encompasses a broad array of methods and models aimed at solving a single or a collection of PDE problems, called multitask learning. PIML is characterized by the incorporation of physical laws into the training process of machine learning models in lieu of large data when solving PDE problems. Despite the overall success of this collection of methods, it remains incredibly difficult to analyze, benchmark, and generally compare one approach to another. Using Kolmogorov n-widths as a measure of effectiveness of approximating functions, we judiciously apply this metric in the comparison of various multitask PIML architectures. We compute lower accuracy bounds and analyze the model's learned basis functions on various PDE problems. This is the first objective metric for comparing multitask PIML architectures and helps remove uncertainty in model validation from selective sampling and overfitting. We also identify avenues of improvement for model architectures, such as the choice of activation function, which can drastically affect model generalization to "worst-case" scenarios, which is not observed when reporting task-specific errors. We also incorporate this metric into the optimization process through regularization, which improves the models' generalizability over the multitask PDE problem.
♻ ☆ Hybrid Decentralized Optimization: Leveraging Both First- and Zeroth-Order Optimizers for Faster Convergence
Distributed optimization is the standard way of speeding up machine learning training, and most of the research in the area focuses on distributed first-order, gradient-based methods. Yet, there are settings where some computationally-bounded nodes may not be able to implement first-order, gradient-based optimization, while they could still contribute to joint optimization tasks. In this paper, we initiate the study of hybrid decentralized optimization, studying settings where nodes with zeroth-order and first-order optimization capabilities co-exist in a distributed system, and attempt to jointly solve an optimization task over some data distribution. We essentially show that, under reasonable parameter settings, such a system can not only withstand noisier zeroth-order agents but can even benefit from integrating such agents into the optimization process, rather than ignoring their information. At the core of our approach is a new analysis of distributed optimization with noisy and possibly-biased gradient estimators, which may be of independent interest. Our results hold for both convex and non-convex objectives. Experimental results on standard optimization tasks confirm our analysis, showing that hybrid first-zeroth order optimization can be practical, even when training deep neural networks.
comment: Shayan Talaei and Matin Ansaripour contributed equally to this work
♻ ☆ The Need for Guardrails with Large Language Models in Medical Safety-Critical Settings: An Artificial Intelligence Application in the Pharmacovigilance Ecosystem
Large language models (LLMs) are useful tools with the capacity for performing specific types of knowledge work at an effective scale. However, LLM deployments in high-risk and safety-critical domains pose unique challenges, notably the issue of ``hallucination,'' where LLMs can generate fabricated information. This is particularly concerning in settings such as drug safety, where inaccuracies could lead to patient harm. To mitigate these risks, we have developed and demonstrated a proof of concept suite of guardrails specifically designed to mitigate certain types of hallucinations and errors for drug safety, and potentially applicable to other medical safety-critical contexts. These guardrails include mechanisms to detect anomalous documents to prevent the ingestion of inappropriate data, identify incorrect drug names or adverse event terms, and convey uncertainty in generated content. We integrated these guardrails with an LLM fine-tuned for a text-to-text task, which involves converting both structured and unstructured data within adverse event reports into natural language. This method was applied to translate individual case safety reports, demonstrating effective application in a pharmacovigilance processing task. Our guardrail framework offers a set of tools with broad applicability across various domains, ensuring LLMs can be safely used in high-risk situations by eliminating the occurrence of key errors, including the generation of incorrect pharmacovigilance-related terms, thus adhering to stringent regulatory and quality standards in medical safety-critical environments.
comment: 27 pages, 6 figures, 4 tables and supplementary material provided
♻ ☆ GenoCraft: A Comprehensive, User-Friendly Web-Based Platform for High-Throughput Omics Data Analysis and Visualization
The surge in high-throughput omics data has reshaped the landscape of biological research, underlining the need for powerful, user-friendly data analysis and interpretation tools. This paper presents GenoCraft, a web-based comprehensive software solution designed to handle the entire pipeline of omics data processing. GenoCraft offers a unified platform featuring advanced bioinformatics tools, covering all aspects of omics data analysis. It encompasses a range of functionalities, such as normalization, quality control, differential analysis, network analysis, pathway analysis, and diverse visualization techniques. This software makes state-of-the-art omics data analysis more accessible to a wider range of users. With GenoCraft, researchers and data scientists have access to an array of cutting-edge bioinformatics tools under a user-friendly interface, making it a valuable resource for managing and analyzing large-scale omics data. The API with an interactive web interface is publicly available at https://genocraft.stanford. edu/. We also release all the codes in https://github.com/futianfan/GenoCraft.
♻ ☆ $μ$GUIDE: a framework for quantitative imaging via generalized uncertainty-driven inference using deep learning
This work proposes $\mu$GUIDE: a general Bayesian framework to estimate posterior distributions of tissue microstructure parameters from any given biophysical model or MRI signal representation, with exemplar demonstration in diffusion-weighted MRI. Harnessing a new deep learning architecture for automatic signal feature selection combined with simulation-based inference and efficient sampling of the posterior distributions, $\mu$GUIDE bypasses the high computational and time cost of conventional Bayesian approaches and does not rely on acquisition constraints to define model-specific summary statistics. The obtained posterior distributions allow to highlight degeneracies present in the model definition and quantify the uncertainty and ambiguity of the estimated parameters.
♻ ☆ Partially Observable Multi-Agent Reinforcement Learning with Information Sharing ICML 2023
We study provable multi-agent reinforcement learning (RL) in the general framework of partially observable stochastic games (POSGs). To circumvent the known hardness results and the use of computationally intractable oracles, we advocate leveraging the potential \emph{information-sharing} among agents, a common practice in empirical multi-agent RL, and a standard model for multi-agent control systems with communications. We first establish several computational complexity results to justify the necessity of information-sharing, as well as the observability assumption that has enabled quasi-efficient single-agent RL with partial observations, for efficiently solving POSGs. {Inspired by the inefficiency of planning in the ground-truth model,} we then propose to further \emph{approximate} the shared common information to construct an {approximate model} of the POSG, in which planning an approximate \emph{equilibrium} (in terms of solving the original POSG) can be quasi-efficient, i.e., of quasi-polynomial-time, under the aforementioned assumptions. Furthermore, we develop a partially observable multi-agent RL algorithm that is \emph{both} statistically and computationally quasi-efficient. {Finally, beyond equilibrium learning, we extend our algorithmic framework to finding the \emph{team-optimal solution} in cooperative POSGs, i.e., decentralized partially observable Markov decision processes, a much more challenging goal. We establish concrete computational and sample complexities under several common structural assumptions of the model.} We hope our study could open up the possibilities of leveraging and even designing different \emph{information structures}, a well-studied notion in control theory, for developing both sample- and computation-efficient partially observable multi-agent RL.
comment: Journal extension of the conference version at ICML 2023. Changed to the more general reward function form, added new results for learning in Dec-POMDPs, and streamlined proof outlines
♻ ☆ Domain Decomposition-based coupling of Operator Inference reduced order models via the Schwarz alternating method
This paper presents and evaluates an approach for coupling together subdomain-local reduced order models (ROMs) constructed via non-intrusive operator inference (OpInf) with each other and with subdomain-local full order models (FOMs), following a domain decomposition of the spatial geometry on which a given partial differential equation (PDE) is posed. Joining subdomain-local models is accomplished using the overlapping Schwarz alternating method, a minimally-intrusive multiscale coupling technique that works by transforming a monolithic problem into a sequence of subdomain-local problems, which communicate through transmission boundary conditions imposed on the subdomain interfaces. After formulating the overlapping Schwarz alternating method for OpInf ROMs, termed OpInf-Schwarz, we evaluate the method's accuracy and efficiency on several test cases involving the heat equation in two spatial dimensions. We demonstrate that the method is capable of coupling together arbitrary combinations of OpInf ROMs and FOMs, and that speed-ups over a monolithic FOM are possible when performing OpInf ROM coupling.
♻ ☆ Simple and Scalable Strategies to Continually Pre-train Large Language Models
Large language models (LLMs) are routinely pre-trained on billions of tokens, only to start the process over again once new data becomes available. A much more efficient solution is to continually pre-train these models, saving significant compute compared to re-training. However, the distribution shift induced by new data typically results in degraded performance on previous data or poor adaptation to the new data. In this work, we show that a simple and scalable combination of learning rate (LR) re-warming, LR re-decaying, and replay of previous data is sufficient to match the performance of fully re-training from scratch on all available data, as measured by the final loss and the average score on several language model (LM) evaluation benchmarks. Specifically, we show this for a weak but realistic distribution shift between two commonly used LLM pre-training datasets (English$\rightarrow$English) and a stronger distribution shift (English$\rightarrow$German) at the $405$M parameter model scale with large dataset sizes (hundreds of billions of tokens). Selecting the weak but realistic shift for larger-scale experiments, we also find that our continual learning strategies match the re-training baseline for a 10B parameter LLM. Our results demonstrate that LLMs can be successfully updated via simple and scalable continual learning strategies, matching the re-training baseline using only a fraction of the compute. Finally, inspired by previous work, we propose alternatives to the cosine learning rate schedule that help circumvent forgetting induced by LR re-warming and that are not bound to a fixed token budget.
♻ ☆ Convolutional L2LFlows: Generating Accurate Showers in Highly Granular Calorimeters Using Convolutional Normalizing Flows
In the quest to build generative surrogate models as computationally efficient alternatives to rule-based simulations, the quality of the generated samples remains a crucial frontier. So far, normalizing flows have been among the models with the best fidelity. However, as the latent space in such models is required to have the same dimensionality as the data space, scaling up normalizing flows to high dimensional datasets is not straightforward. The prior L2LFlows approach successfully used a series of separate normalizing flows and sequence of conditioning steps to circumvent this problem. In this work, we extend L2LFlows to simulate showers with a 9-times larger profile in the lateral direction. To achieve this, we introduce convolutional layers and U-Net-type connections, move from masked autoregressive flows to coupling layers, and demonstrate the successful modelling of showers in the ILD Electromagnetic Calorimeter as well as Dataset 3 from the public CaloChallenge dataset.
♻ ☆ Multi-Agent Reinforcement Learning from Human Feedback: Data Coverage and Algorithmic Techniques
We initiate the study of Multi-Agent Reinforcement Learning from Human Feedback (MARLHF), exploring both theoretical foundations and empirical validations. We define the task as identifying Nash equilibrium from a preference-only offline dataset in general-sum games, a problem marked by the challenge of sparse feedback signals. Our theory establishes the upper complexity bounds for Nash Equilibrium in effective MARLHF, demonstrating that single-policy coverage is inadequate and highlighting the importance of unilateral dataset coverage. These theoretical insights are verified through comprehensive experiments. To enhance the practical performance, we further introduce two algorithmic techniques. (1) We propose a Mean Squared Error (MSE) regularization along the time axis to achieve a more uniform reward distribution and improve reward learning outcomes. (2) We utilize imitation learning to approximate the reference policy, ensuring stability and effectiveness in training. Our findings underscore the multifaceted approach required for MARLHF, paving the way for effective preference-based multi-agent systems.
♻ ☆ Revisiting Character-level Adversarial Attacks for Language Models ICML 2024
Adversarial attacks in Natural Language Processing apply perturbations in the character or token levels. Token-level attacks, gaining prominence for their use of gradient-based methods, are susceptible to altering sentence semantics, leading to invalid adversarial examples. While character-level attacks easily maintain semantics, they have received less attention as they cannot easily adopt popular gradient-based methods, and are thought to be easy to defend. Challenging these beliefs, we introduce Charmer, an efficient query-based adversarial attack capable of achieving high attack success rate (ASR) while generating highly similar adversarial examples. Our method successfully targets both small (BERT) and large (Llama 2) models. Specifically, on BERT with SST-2, Charmer improves the ASR in 4.84% points and the USE similarity in 8% points with respect to the previous art. Our implementation is available in https://github.com/LIONS-EPFL/Charmer.
comment: Accepted in ICML 2024
♻ ☆ Privacy-aware Berrut Approximated Coded Computing for Federated Learning
Federated Learning (FL) is an interesting strategy that enables the collaborative training of an AI model among different data owners without revealing their private datasets. Even so, FL has some privacy vulnerabilities that have been tried to be overcome by applying some techniques like Differential Privacy (DP), Homomorphic Encryption (HE), or Secure Multi-Party Computation (SMPC). However, these techniques have some important drawbacks that might narrow their range of application: problems to work with non-linear functions and to operate large matrix multiplications and high communication and computational costs to manage semi-honest nodes. In this context, we propose a solution to guarantee privacy in FL schemes that simultaneously solves the previously mentioned problems. Our proposal is based on the Berrut Approximated Coded Computing, a technique from the Coded Distributed Computing paradigm, adapted to a Secret Sharing configuration, to provide input privacy to FL in a scalable way. It can be applied for computing non-linear functions and treats the special case of distributed matrix multiplication, a key primitive at the core of many automated learning tasks. Because of these characteristics, it could be applied in a wide range of FL scenarios, since it is independent of the machine learning models or aggregation algorithms used in the FL scheme. We provide analysis of the achieved privacy and complexity of our solution and, due to the extensive numerical results performed, a good trade-off between privacy and precision can be observed.
♻ ☆ A Systematic Bias of Machine Learning Regression Models and Its Correction: an Application to Imaging-based Brain Age Prediction
Machine learning models for continuous outcomes often yield systematically biased predictions, particularly for values that largely deviate from the mean. Specifically, predictions for large-valued outcomes tend to be negatively biased (underestimating actual values), while those for small-valued outcomes are positively biased (overestimating actual values). We refer to this linear central tendency warped bias as the "systematic bias of machine learning regression". In this paper, we first demonstrate that this systematic prediction bias persists across various machine learning regression models, and then delve into its theoretical underpinnings. To address this issue, we propose a general constrained optimization approach designed to correct this bias and develop computationally efficient implementation algorithms. Simulation results indicate that our correction method effectively eliminates the bias from the predicted outcomes. We apply the proposed approach to the prediction of brain age using neuroimaging data. In comparison to competing machine learning regression models, our method effectively addresses the longstanding issue of "systematic bias of machine learning regression" in neuroimaging-based brain age calculation, yielding unbiased predictions of brain age.
♻ ☆ Pre-processing and Compression: Understanding Hidden Representation Refinement Across Imaging Domains via Intrinsic Dimension
In recent years, there has been interest in how geometric properties such as intrinsic dimension (ID) of a neural network's hidden representations change through its layers, and how such properties are predictive of important model behavior such as generalization ability. However, evidence has begun to emerge that such behavior can change significantly depending on the domain of the network's training data, such as natural versus medical images. Here, we further this inquiry by exploring how the ID of a network's learned representations changes through its layers, in essence, characterizing how the network successively refines the information content of input data to be used for predictions. Analyzing eleven natural and medical image datasets across six network architectures, we find that how ID changes through the network differs noticeably between natural and medical image models. Specifically, medical image models peak in representation ID earlier in the network, implying a difference in the image features and their abstractness that are typically used for downstream tasks in these domains. Additionally, we discover a strong correlation of this peak representation ID with the ID of the data in its input space, implying that the intrinsic information content of a model's learned representations is guided by that of the data it was trained on. Overall, our findings emphasize notable discrepancies in network behavior between natural and non-natural imaging domains regarding hidden representation information content, and provide further insights into how a network's learned features are shaped by its training data.
♻ ☆ Energy-Efficient Channel Decoding for Wireless Federated Learning: Convergence Analysis and Adaptive Design
One of the most critical challenges for deploying distributed learning solutions, such as federated learning (FL), in wireless networks is the limited battery capacity of mobile clients. While it is a common belief that the major energy consumption of mobile clients comes from the uplink data transmission, this paper presents a novel finding, namely channel decoding also contributes significantly to the overall energy consumption of mobile clients in FL. Motivated by this new observation, we propose an energy-efficient adaptive channel decoding scheme that leverages the intrinsic robustness of FL to model errors. In particular, the robustness is exploited to reduce the energy consumption of channel decoders at mobile clients by adaptively adjusting the number of decoding iterations. We theoretically prove that wireless FL with communication errors can converge at the same rate as the case with error-free communication provided the bit error rate (BER) is properly constrained. An adaptive channel decoding scheme is then proposed to improve the energy efficiency of wireless FL systems. Experimental results demonstrate that the proposed method maintains the same learning accuracy while reducing the channel decoding energy consumption by ~20% when compared to an existing approach.
comment: This paper has been accepted by the IEEE TWC. Copyright may be transferred without notice, after which this version may no longer be accessible
♻ ☆ Negation Blindness in Large Language Models: Unveiling the NO Syndrome in Image Generation
Foundational Large Language Models (LLMs) have changed the way we perceive technology. They have been shown to excel in tasks ranging from poem writing and coding to essay generation and puzzle solving. With the incorporation of image generation capability, they have become more comprehensive and versatile AI tools. At the same time, researchers are striving to identify the limitations of these tools to improve them further. Currently identified flaws include hallucination, biases, and bypassing restricted commands to generate harmful content. In the present work, we have identified a fundamental limitation related to the image generation ability of LLMs, and termed it The NO Syndrome. This negation blindness refers to LLMs inability to correctly comprehend NO related natural language prompts to generate the desired images. Interestingly, all tested LLMs including GPT-4, Gemini, and Copilot were found to be suffering from this syndrome. To demonstrate the generalization of this limitation, we carried out simulation experiments and conducted entropy-based and benchmark statistical analysis tests on various LLMs in multiple languages, including English, Hindi, and French. We conclude that the NO syndrome is a significant flaw in current LLMs that needs to be addressed. A related finding of this study showed a consistent discrepancy between image and textual responses as a result of this NO syndrome. We posit that the introduction of a negation context-aware reinforcement learning based feedback loop between the LLMs textual response and generated image could help ensure the generated text is based on both the LLMs correct contextual understanding of the negation query and the generated visual output.
comment: 15 pages, 7 figures
♻ ☆ Different Victims, Same Layout: Email Visual Similarity Detection for Enhanced Email Protection CCS 2024
In the pursuit of an effective spam detection system, the focus has often been on identifying known spam patterns either through rule-based detection systems or machine learning (ML) solutions that rely on keywords. However, both systems are susceptible to evasion techniques and zero-day attacks that can be achieved at low cost. Therefore, an email that bypassed the defense system once can do it again in the following days, even though rules are updated or the ML models are retrained. The recurrence of failures to detect emails that exhibit layout similarities to previously undetected spam is concerning for customers and can erode their trust in a company. Our observations show that threat actors reuse email kits extensively and can bypass detection with little effort, for example, by making changes to the content of emails. In this work, we propose an email visual similarity detection approach, named Pisco, to improve the detection capabilities of an email threat defense system. We apply our proof of concept to some real-world samples received from different sources. Our results show that email kits are being reused extensively and visually similar emails are sent to our customers at various time intervals. Therefore, this method could be very helpful in situations where detection engines that rely on textual features and keywords are bypassed, an occurrence our observations show happens frequently.
comment: To be published in the proceedings of the ACM Conference on Computer and Communications Security (ACM CCS 2024)
♻ ☆ Fast and interpretable Support Vector Classification based on the truncated ANOVA decomposition
Support Vector Machines (SVMs) are an important tool for performing classification on scattered data, where one usually has to deal with many data points in high-dimensional spaces. We propose solving SVMs in primal form using feature maps based on trigonometric functions or wavelets. In small dimensional settings the Fast Fourier Transform (FFT) and related methods are a powerful tool in order to deal with the considered basis functions. For growing dimensions the classical FFT-based methods become inefficient due to the curse of dimensionality. Therefore, we restrict ourselves to multivariate basis functions, each of which only depends on a small number of dimensions. This is motivated by the well-known sparsity of effects and recent results regarding the reconstruction of functions from scattered data in terms of truncated analysis of variance (ANOVA) decompositions, which makes the resulting model even interpretable in terms of importance of the features as well as their couplings. The usage of small superposition dimensions has the consequence that the computational effort no longer grows exponentially but only polynomially with respect to the dimension. In order to enforce sparsity regarding the basis coefficients, we use the frequently applied $\ell_2$-norm and, in addition, $\ell_1$-norm regularization. The found classifying function, which is the linear combination of basis functions, and its variance can then be analyzed in terms of the classical ANOVA decomposition of functions. Based on numerical examples we show that we are able to recover the signum of a function that perfectly fits our model assumptions. Furthermore, we perform classification on different artificial and real-world data sets. We obtain better results with $\ell_1$-norm regularization, both in terms of accuracy and clarity of interpretability.
♻ ☆ The future of cosmological likelihood-based inference: accelerated high-dimensional parameter estimation and model comparison
We advocate for a new paradigm of cosmological likelihood-based inference, leveraging recent developments in machine learning and its underlying technology, to accelerate Bayesian inference in high-dimensional settings. Specifically, we combine (i) emulation, where a machine learning model is trained to mimic cosmological observables, e.g. CosmoPower-JAX; (ii) differentiable and probabilistic programming, e.g. JAX and NumPyro, respectively; (iii) scalable Markov chain Monte Carlo (MCMC) sampling techniques that exploit gradients, e.g. Hamiltonian Monte Carlo; and (iv) decoupled and scalable Bayesian model selection techniques that compute the Bayesian evidence purely from posterior samples, e.g. the learned harmonic mean implemented in harmonic. This paradigm allows us to carry out a complete Bayesian analysis, including both parameter estimation and model selection, in a fraction of the time of traditional approaches. First, we demonstrate the application of this paradigm on a simulated cosmic shear analysis for a Stage IV survey in 37- and 39-dimensional parameter spaces, comparing $\Lambda$CDM and a dynamical dark energy model ($w_0w_a$CDM). We recover posterior contours and evidence estimates that are in excellent agreement with those computed by the traditional nested sampling approach while reducing the computational cost from 8 months on 48 CPU cores to 2 days on 12 GPUs. Second, we consider a joint analysis between three simulated next-generation surveys, each performing a 3x2pt analysis, resulting in 157- and 159-dimensional parameter spaces. Standard nested sampling techniques are simply unlikely to be feasible in this high-dimensional setting, requiring a projected 12 years of compute time on 48 CPU cores; on the other hand, the proposed approach only requires 8 days of compute time on 24 GPUs. All packages used in our analyses are publicly available.
comment: 14 pages, 6 figures. Accepted for publication in the Open Journal of Astrophysics. Codes available at https://github.com/alessiospuriomancini/cosmopower, https://github.com/dpiras/cosmopower-jax, https://github.com/astro-informatics/harmonic/
♻ ☆ A possible late-time transition of $M_B$ inferred via neural networks
The strengthening of tensions in the cosmological parameters has led to a reconsideration of fundamental aspects of standard cosmology. The tension in the Hubble constant can also be viewed as a tension between local and early Universe constraints on the absolute magnitude $M_B$ of Type Ia supernova. In this work, we reconsider the possibility of a variation of this parameter in a model-independent way. We employ neural networks to agnostically constrain the value of the absolute magnitude as well as assess the impact and statistical significance of a variation in $M_B$ with redshift from the Pantheon+ compilation, together with a thorough analysis of the neural network architecture. We find an indication for a possible transition redshift at the $z\approx 1$ region.
comment: 13 pages, 9 sets of figures, 2 tables. To appear in JCAP
♻ ☆ Variational Mode Decomposition and Linear Embeddings are What You Need For Time-Series Forecasting
Time-series forecasting often faces challenges due to data volatility, which can lead to inaccurate predictions. Variational Mode Decomposition (VMD) has emerged as a promising technique to mitigate volatility by decomposing data into distinct modes, thereby enhancing forecast accuracy. In this study, we integrate VMD with linear models to develop a robust forecasting framework. Our approach is evaluated on 13 diverse datasets, including ETTm2, WindTurbine, M4, and 10 air quality datasets from various Southeast Asian cities. The effectiveness of the VMD strategy is assessed by comparing Root Mean Squared Error (RMSE) values from models utilizing VMD against those without it. Additionally, we benchmark linear-based models against well-known neural network architectures such as LSTM, Bidirectional LSTM, and RNN. The results demonstrate a significant reduction in RMSE across nearly all models following VMD application. Notably, the Linear + VMD model achieved the lowest average RMSE in univariate forecasting at 0.619. In multivariate forecasting, the DLinear + VMD model consistently outperformed others, attaining the lowest RMSE across all datasets with an average of 0.019. These findings underscore the effectiveness of combining VMD with linear models for superior time-series forecasting.
comment: For associated repository, see https://github.com/Espalemit/VMD-With-LTSF-Linear.git
♻ ☆ GT-CausIn: a novel causal-based insight for traffic prediction
Traffic forecasting is an important application of spatiotemporal series prediction. Among different methods, graph neural networks have achieved so far the most promising results, learning relations between graph nodes then becomes a crucial task. However, improvement space is very limited when these relations are learned in a node-to-node manner. The challenge stems from (1) obscure temporal dependencies between different stations, (2) difficulties in defining variables beyond the node level, and (3) no ready-made method to validate the learned relations. To confront these challenges, we define legitimate traffic causal variables to discover the causal relation inside the traffic network, which is carefully checked with statistic tools and case analysis. We then present a novel model named Graph Spatial-Temporal Network Based on Causal Insight (GT-CausIn), where prior learned causal information is integrated with graph diffusion layers and temporal convolutional network (TCN) layers. Experiments are carried out on two real-world traffic datasets: PEMS-BAY and METR-LA, which show that GT-CausIn significantly outperforms the state-of-the-art models on mid-term and long-term prediction.
♻ ☆ When Does Visual Prompting Outperform Linear Probing for Vision-Language Models? A Likelihood Perspective
Adapting pre-trained models to new tasks can exhibit varying effectiveness across datasets. Visual prompting, a state-of-the-art parameter-efficient transfer learning method, can significantly improve the performance of out-of-distribution tasks. On the other hand, linear probing, a standard transfer learning method, can sometimes become the best approach. We propose a log-likelihood ratio (LLR) approach to analyze the comparative benefits of visual prompting and linear probing. By employing the LLR score alongside resource-efficient visual prompts approximations, our cost-effective measure attains up to a 100-fold reduction in run time compared to full training, while achieving prediction accuracies up to 91%. The source code is available at https://github.com/IBM/VP-LLR.
♻ ☆ Pseudo Replay-based Class Continual Learning for Online New Category Anomaly Detection in Additive Manufacturing
The incorporation of advanced sensors and machine learning techniques has enabled modern manufacturing enterprises to perform data-driven classification-based anomaly detection based on the sensor data collected in manufacturing processes. However, one critical challenge is that newly presented defect category may manifest as the manufacturing process continues, resulting in monitoring performance deterioration of previously trained machine learning models. Hence, there is an increasing need for empowering machine learning models to learn continually. Among all continual learning methods, memory-based continual learning has the best performance but faces the constraints of data storage capacity. To address this issue, this paper develops a novel pseudo replay-based continual learning framework by integrating class incremental learning and oversampling-based data generation. Without storing all the data, the developed framework could generate high-quality data representing previous classes to train machine learning model incrementally when new category anomaly occurs. In addition, it could even enhance the monitoring performance since it also effectively improves the data quality. The effectiveness of the proposed framework is validated in three cases studies, which leverages supervised classification problem for anomaly detection. The experimental results show that the developed method is very promising in detecting novel anomaly while maintaining a good performance on the previous task and brings up more flexibility in model architecture.
♻ ☆ Navigating the Maize: Cyclic and conditional computational graphs for molecular simulation
Many computational chemistry and molecular simulation workflows can be expressed as graphs. This abstraction is useful to modularize and potentially reuse existing components, as well as provide parallelization and ease reproducibility. Existing tools represent the computation as a directed acyclic graph (DAG), thus allowing efficient execution by parallelization of concurrent branches. These systems can, however, generally not express cyclic and conditional workflows. We therefore developed Maize, a workflow manager for cyclic and conditional graphs based on the principles of flow-based programming. By running each node of the graph concurrently in separate processes and allowing communication at any time through dedicated inter-node channels, arbitrary graph structures can be executed. We demonstrate the effectiveness of the tool on a dynamic active learning task in computational drug design, involving the use of a small molecule generative model and an associated scoring system, and on a reactivity prediction pipeline using quantum-chemistry and semiempirical approaches.
♻ ☆ DNN-GDITD: Out-of-distribution detection via Deep Neural Network based Gaussian Descriptor for Imbalanced Tabular Data
Classification tasks present challenges due to class imbalances and evolving data distributions. Addressing these issues requires a robust method to handle imbalances while effectively detecting out-of-distribution (OOD) samples not encountered during training. This study introduces a novel OOD detection algorithm designed for tabular datasets, titled Deep Neural Network-based Gaussian Descriptor for Imbalanced Tabular Data (DNN-GDITD). The DNN-GDITD algorithm can be placed on top of any DNN to facilitate better classification of imbalanced data and OOD detection using spherical decision boundaries. Using a combination of Push, Score-based, and focal losses, DNN-GDITD assigns confidence scores to test data points, categorizing them as known classes or as an OOD sample. Extensive experimentation on tabular datasets demonstrates the effectiveness of DNN-GDITD compared to three OOD algorithms. Evaluation encompasses imbalanced and balanced scenarios on diverse tabular datasets, including a synthetic financial dispute dataset and publicly available tabular datasets like Gas Sensor, Drive Diagnosis, and MNIST, showcasing DNN-GDITD's versatility.
comment: 17 pages
♻ ☆ MMA-MRNNet: Harnessing Multiple Models of Affect and Dynamic Masked RNN for Precise Facial Expression Intensity Estimation
This paper presents MMA-MRNNet, a novel deep learning architecture for dynamic multi-output Facial Expression Intensity Estimation (FEIE) from video data. Traditional approaches to this task often rely on complex 3-D CNNs, which require extensive pre-training and assume that facial expressions are uniformly distributed across all frames of a video. These methods struggle to handle videos of varying lengths, often resorting to ad-hoc strategies that either discard valuable information or introduce bias. MMA-MRNNet addresses these challenges through a two-stage process. First, the Multiple Models of Affect (MMA) extractor component is a Multi-Task Learning CNN that concurrently estimates valence-arousal, recognizes basic facial expressions, and detects action units in each frame. These representations are then processed by a Masked RNN component, which captures temporal dependencies and dynamically updates weights according to the true length of the input video, ensuring that only the most relevant features are used for the final prediction. The proposed unimodal non-ensemble learning MMA-MRNNet was evaluated on the Hume-Reaction dataset and demonstrated significantly superior performance, surpassing state-of-the-art methods by a wide margin, regardless of whether they were unimodal, multimodal, or ensemble approaches. Finally, we demonstrated the effectiveness of the MMA component of our proposed method across multiple in-the-wild datasets, where it consistently outperformed all state-of-the-art methods across various metrics.
♻ ☆ What Formal Languages Can Transformers Express? A Survey
As transformers have gained prominence in natural language processing, some researchers have investigated theoretically what problems they can and cannot solve, by treating problems as formal languages. Exploring such questions can help clarify the power of transformers relative to other models of computation, their fundamental capabilities and limits, and the impact of architectural choices. Work in this subarea has made considerable progress in recent years. Here, we undertake a comprehensive survey of this work, documenting the diverse assumptions that underlie different results and providing a unified framework for harmonizing seemingly contradictory findings.
comment: One minor correction in {\S}5.1
♻ ☆ Decision-Focused Learning: Foundations, State of the Art, Benchmark and Future Opportunities
Decision-focused learning (DFL) is an emerging paradigm that integrates machine learning (ML) and constrained optimization to enhance decision quality by training ML models in an end-to-end system. This approach shows significant potential to revolutionize combinatorial decision-making in real-world applications that operate under uncertainty, where estimating unknown parameters within decision models is a major challenge. This paper presents a comprehensive review of DFL, providing an in-depth analysis of both gradient-based and gradient-free techniques used to combine ML and constrained optimization. It evaluates the strengths and limitations of these techniques and includes an extensive empirical evaluation of eleven methods across seven problems. The survey also offers insights into recent advancements and future research directions in DFL. Code and benchmark: https://github.com/PredOpt/predopt-benchmarks
comment: Experimental Survey and Benchmarking
♻ ☆ Can Vehicle Motion Planning Generalize to Realistic Long-tail Scenarios?
Real-world autonomous driving systems must make safe decisions in the face of rare and diverse traffic scenarios. Current state-of-the-art planners are mostly evaluated on real-world datasets like nuScenes (open-loop) or nuPlan (closed-loop). In particular, nuPlan seems to be an expressive evaluation method since it is based on real-world data and closed-loop, yet it mostly covers basic driving scenarios. This makes it difficult to judge a planner's capabilities to generalize to rarely-seen situations. Therefore, we propose a novel closed-loop benchmark interPlan containing several edge cases and challenging driving scenarios. We assess existing state-of-the-art planners on our benchmark and show that neither rule-based nor learning-based planners can safely navigate the interPlan scenarios. A recently evolving direction is the usage of foundation models like large language models (LLM) to handle generalization. We evaluate an LLM-only planner and introduce a novel hybrid planner that combines an LLM-based behavior planner with a rule-based motion planner that achieves state-of-the-art performance on our benchmark.
♻ ☆ In the Search for Optimal Multi-view Learning Models for Crop Classification with Global Remote Sensing Data
Studying and analyzing cropland is a difficult task due to its dynamic and heterogeneous growth behavior. Usually, diverse data sources can be collected for its estimation. Although deep learning models have proven to excel in the crop classification task, they face substantial challenges when dealing with multiple inputs, named Multi-View Learning (MVL). The methods used in the MVL scenario can be structured based on the encoder architecture, the fusion strategy, and the optimization technique. The literature has primarily focused on using specific encoder architectures for local regions, lacking a deeper exploration of other components in the MVL methodology. In contrast, we investigate the simultaneous selection of the fusion strategy and encoder architecture, assessing global-scale cropland and crop-type classifications. We use a range of five fusion strategies (Input, Feature, Decision, Ensemble, Hybrid) and five temporal encoders (LSTM, GRU, TempCNN, TAE, L-TAE) as possible configurations in the MVL method. We use the CropHarvest dataset for validation, which provides optical, radar, weather time series, and topographic information as input data. We found that in scenarios with a limited number of labeled samples, a unique configuration is insufficient for all the cases. Instead, a specialized combination should be meticulously sought, including an encoder and fusion strategy. To streamline this search process, we suggest identifying the optimal encoder architecture tailored for a particular fusion strategy, and then determining the most suitable fusion strategy for the classification task. We provide a methodological framework for researchers exploring crop classification through an MVL methodology.
comment: submitted to journal
♻ ☆ Increasing the Robustness of Model Predictions to Missing Sensors in Earth Observation ACL
Multi-sensor ML models for EO aim to enhance prediction accuracy by integrating data from various sources. However, the presence of missing data poses a significant challenge, particularly in non-persistent sensors that can be affected by external factors. Existing literature has explored strategies like temporal dropout and sensor-invariant models to address the generalization to missing data issues. Inspired by these works, we study two novel methods tailored for multi-sensor scenarios, namely Input Sensor Dropout (ISensD) and Ensemble Sensor Invariant (ESensI). Through experimentation on three multi-sensor temporal EO datasets, we demonstrate that these methods effectively increase the robustness of model predictions to missing sensors. Particularly, we focus on how the predictive performance of models drops when sensors are missing at different levels. We observe that ensemble multi-sensor models are the most robust to the lack of sensors. In addition, the sensor dropout component in ISensD shows promising robustness results.
comment: Accepted at the MACLEAN workshop in the ECML/PKDD 2024
♻ ☆ Scalable Glacier Mapping using Deep Learning and Open Earth Observation Data Matches the Accuracy of Manual Delineation
Accurate global glacier mapping is critical for understanding climate change impacts. Despite its importance, automated glacier mapping at a global scale remains largely unexplored. Here we address this gap and propose Glacier-VisionTransformer-U-Net (GlaViTU), a convolutional-transformer deep learning model, and five strategies for multitemporal global-scale glacier mapping using open satellite imagery. Assessing the spatial, temporal and cross-sensor generalisation shows that our best strategy achieves intersection over union >0.85 on previously unobserved images in most cases, which drops to >0.75 for debris-rich areas such as High-Mountain Asia and increases to >0.90 for regions dominated by clean ice. A comparative validation against human expert uncertainties in terms of area and distance deviations underscores GlaViTU performance, approaching or matching expert-level delineation. Adding synthetic aperture radar data, namely, backscatter and interferometric coherence, increases the accuracy in all regions where available. The calibrated confidence for glacier extents is reported making the predictions more reliable and interpretable. We also release a benchmark dataset that covers 9% of glaciers worldwide. Our results support efforts towards automated multitemporal and global glacier mapping.
comment: after major revision, expanded validation
♻ ☆ Smart E-commerce Recommendations with Semantic AI
In e-commerce, web mining for page recommendations is widely used but often fails to meet user needs. To address this, we propose a novel solution combining semantic web mining with BP neural networks. We process user search logs to extract five key features: content priority, time spent, user feedback, recommendation semantics, and input deviation. These features are then fed into a BP neural network to classify and prioritize web pages. The prioritized pages are recommended to users. Using book sales pages for testing, our results demonstrate that this solution can quickly and accurately identify the pages users need. Our approach ensures that recommendations are more relevant and tailored to individual preferences, enhancing the online shopping experience. By leveraging advanced semantic analysis and neural network techniques, we bridge the gap between user expectations and actual recommendations. This innovative method not only improves accuracy but also speeds up the recommendation process, making it a valuable tool for e-commerce platforms aiming to boost user satisfaction and engagement. Additionally, our system ability to handle large datasets and provide real-time recommendations makes it a scalable and efficient solution for modern e-commerce challenges.
comment: 8 pages
♻ ☆ A Hybrid Framework for Spatial Interpolation: Merging Data-driven with Domain Knowledge
Estimating spatially distributed information through the interpolation of scattered observation datasets often overlooks the critical role of domain knowledge in understanding spatial dependencies. Additionally, the features of these data sets are typically limited to the spatial coordinates of the scattered observation locations. In this paper, we propose a hybrid framework that integrates data-driven spatial dependency feature extraction with rule-assisted spatial dependency function mapping to augment domain knowledge. We demonstrate the superior performance of our framework in two comparative application scenarios, highlighting its ability to capture more localized spatial features in the reconstructed distribution fields. Furthermore, we underscore its potential to enhance nonlinear estimation capabilities through the application of transformed fuzzy rules and to quantify the inherent uncertainties associated with the observation data sets. Our framework introduces an innovative approach to spatial information estimation by synergistically combining observational data with rule-assisted domain knowledge.
comment: 21 pages, 13 figures; typos corrected, references updated
♻ ☆ Open Implementation and Study of BEST-RQ for Speech Processing ICASSP 2024
Self-Supervised Learning (SSL) has proven to be useful in various speech tasks. However, these methods are generally very demanding in terms of data, memory, and computational resources. BERT-based Speech pre-Training with Random-projection Quantizer (BEST-RQ), is an SSL method that has shown great performance on Automatic Speech Recognition (ASR) while being simpler than other SSL methods, such as wav2vec 2.0. Despite BEST-RQ's great performance, details are lacking in the original paper, such as the amount of GPU/TPU hours used in pre-training, and there is no official easy-to-use open-source implementation. Furthermore, BEST-RQ has not been evaluated on other downstream tasks aside from ASR and speech translation. In this work, we describe a re-implementation of a Random-projection quantizer and perform a preliminary study with a comparison to wav2vec 2.0 on four downstream tasks. We discuss the details and differences of our implementation. We show that a random projection quantizer can achieve similar downstream performance as wav2vec 2.0 while decreasing training time by over a factor of two.
comment: Accepted in IEEE ICASSP 2024 workshop on Self-supervision in Audio, Speech and Beyond (SASB 2024)
♻ ☆ Moderate Adaptive Linear Units (MoLU)
We propose a new high-performance activation function, Moderate Adaptive Linear Units (MoLU), for the deep neural network. The MoLU is a simple, beautiful and powerful activation function that can be a good main activation function among hundreds of activation functions. Because the MoLU is made up of the elementary functions, not only it is a diffeomorphism (i.e. analytic over whole domains), but also it reduces the training time.
comment: 4 pages, 5 figures
♻ ☆ Prompt Compression with Context-Aware Sentence Encoding for Fast and Improved LLM Inference
Large language models (LLMs) have triggered a new stream of research focusing on compressing the context length to reduce the computational cost while ensuring the retention of helpful information for LLMs to answer the given question. Token-based removal methods are one of the most prominent approaches in this direction, but risk losing the semantics of the context caused by intermediate token removal, especially under high compression ratios, while also facing challenges in computational efficiency. In this work, we propose context-aware prompt compression (CPC), a sentence-level prompt compression technique where its key innovation is a novel context-aware sentence encoder that provides a relevance score for each sentence for a given question. To train this encoder, we generate a new dataset consisting of questions, positives, and negative pairs where positives are sentences relevant to the question, while negatives are irrelevant context sentences. We train the encoder in a contrastive setup to learn context-aware sentence representations. Our method considerably outperforms prior works on prompt compression on benchmark datasets and is up to 10.93x faster at inference compared to the best token-level compression method. We also find better improvement for shorter length constraints in most benchmarks, showing the effectiveness of our proposed solution in the compression of relevant information in a shorter context. Finally, we release the code and the dataset for quick reproducibility and further development: https://github.com/Workday/cpc.
♻ ☆ Simultaneous Training of First- and Second-Order Optimizers in Population-Based Reinforcement Learning
The tuning of hyperparameters in reinforcement learning (RL) is critical, as these parameters significantly impact an agent's performance and learning efficiency. Dynamic adjustment of hyperparameters during the training process can significantly enhance both the performance and stability of learning. Population-based training (PBT) provides a method to achieve this by continuously tuning hyperparameters throughout the training. This ongoing adjustment enables models to adapt to different learning stages, resulting in faster convergence and overall improved performance. In this paper, we propose an enhancement to PBT by simultaneously utilizing both first- and second-order optimizers within a single population. We conducted a series of experiments using the TD3 algorithm across various MuJoCo environments. Our results, for the first time, empirically demonstrate the potential of incorporating second-order optimizers within PBT-based RL. Specifically, the combination of the K-FAC optimizer with Adam led to up to a 10% improvement in overall performance compared to PBT using only Adam. Additionally, in environments where Adam occasionally fails, such as the Swimmer environment, the mixed population with K-FAC exhibited more reliable learning outcomes, offering a significant advantage in training stability without a substantial increase in computational time.
comment: 8 pages, 5 figures
♻ ☆ SparQ Attention: Bandwidth-Efficient LLM Inference
The computational difficulties of large language model (LLM) inference remain a significant obstacle to their widespread deployment. The need for many applications to support long input sequences and process them in large batches typically causes token-generation to be bottlenecked by data transfer. For this reason, we introduce SparQ Attention, a technique for increasing the inference throughput of LLMs by utilising memory bandwidth more efficiently within the attention layers, through selective fetching of the cached history. Our proposed technique can be applied directly to off-the-shelf LLMs during inference, without requiring any modification to the pre-training setup or additional fine-tuning. We show that SparQ Attention brings up to 8x savings in attention data transfers without substantial drops in accuracy, by evaluating Llama 2 and 3, Mistral, Gemma and Pythia models on a wide range of downstream tasks.
♻ ☆ Enhancing Sindhi Word Segmentation using Subword Representation Learning and Position-aware Self-attention
Sindhi word segmentation is a challenging task due to space omission and insertion issues. The Sindhi language itself adds to this complexity. It's cursive and consists of characters with inherent joining and non-joining properties, independent of word boundaries. Existing Sindhi word segmentation methods rely on designing and combining hand-crafted features. However, these methods have limitations, such as difficulty handling out-of-vocabulary words, limited robustness for other languages, and inefficiency with large amounts of noisy or raw text. Neural network-based models, in contrast, can automatically capture word boundary information without requiring prior knowledge. In this paper, we propose a Subword-Guided Neural Word Segmenter (SGNWS) that addresses word segmentation as a sequence labeling task. The SGNWS model incorporates subword representation learning through a bidirectional long short-term memory encoder, position-aware self-attention, and a conditional random field. Our empirical results demonstrate that the SGNWS model achieves state-of-the-art performance in Sindhi word segmentation on six datasets.
comment: Journal Paper, 14 pages
♻ ☆ BiKC: Keypose-Conditioned Consistency Policy for Bimanual Robotic Manipulation
Bimanual manipulation tasks typically involve multiple stages which require efficient interactions between two arms, posing step-wise and stage-wise challenges for imitation learning systems. Specifically, failure and delay of one step will broadcast through time, hinder success and efficiency of each sub-stage task, and thereby overall task performance. Although recent works have made strides in addressing certain challenges, few approaches explicitly consider the multi-stage nature of bimanual tasks while simultaneously emphasizing the importance of inference speed. In this paper, we introduce a novel keypose-conditioned consistency policy tailored for bimanual manipulation. It is a hierarchical imitation learning framework that consists of a high-level keypose predictor and a low-level trajectory generator. The predicted keyposes provide guidance for trajectory generation and also mark the completion of one sub-stage task. The trajectory generator is designed as a consistency model trained from scratch without distillation, which generates action sequences conditioning on current observations and predicted keyposes with fast inference speed. Simulated and real-world experimental results demonstrate that the proposed approach surpasses baseline methods in terms of success rate and operational efficiency. Codes are available at https://github.com/ManUtdMoon/BiKC.
comment: Accepted by The 16th International Workshop on the Algorithmic Foundations of Robotics (WAFR 2024)
♻ ☆ NetMamba: Efficient Network Traffic Classification via Pre-training Unidirectional Mamba
Network traffic classification is a crucial research area aiming to enhance service quality, streamline network management, and bolster cybersecurity. To address the growing complexity of transmission encryption techniques, various machine learning and deep learning methods have been proposed. However, existing approaches face two main challenges. Firstly, they struggle with model inefficiency due to the quadratic complexity of the widely used Transformer architecture. Secondly, they suffer from inadequate traffic representation because of discarding important byte information while retaining unwanted biases. To address these challenges, we propose NetMamba, an efficient linear-time state space model equipped with a comprehensive traffic representation scheme. We adopt a specially selected and improved unidirectional Mamba architecture for the networking field, instead of the Transformer, to address efficiency issues. In addition, we design a traffic representation scheme to extract valid information from massive traffic data while removing biased information. Evaluation experiments on six public datasets encompassing three main classification tasks showcase NetMamba's superior classification performance compared to state-of-the-art baselines. It achieves an accuracy rate of nearly 99% (some over 99%) in all tasks. Additionally, NetMamba demonstrates excellent efficiency, improving inference speed by up to 60 times while maintaining comparably low memory usage. Furthermore, NetMamba exhibits superior few-shot learning abilities, achieving better classification performance with fewer labeled data. To the best of our knowledge, NetMamba is the first model to tailor the Mamba architecture for networking.
♻ ☆ From Categories to Classifiers: Name-Only Continual Learning by Exploring the Web
Continual Learning (CL) often relies on the availability of extensive annotated datasets, an assumption that is unrealistically time-consuming and costly in practice. We explore a novel paradigm termed name-only continual learning where time and cost constraints prohibit manual annotation. In this scenario, learners adapt to new category shifts using only category names without the luxury of annotated training data. Our proposed solution leverages the expansive and ever-evolving internet to query and download uncurated webly-supervised data for image classification. We investigate the reliability of our web data and find them comparable, and in some cases superior, to manually annotated datasets. Additionally, we show that by harnessing the web, we can create support sets that surpass state-of-the-art name-only classification that create support sets using generative models or image retrieval from LAION-5B, achieving up to 25% boost in accuracy. When applied across varied continual learning contexts, our method consistently exhibits a small performance gap in comparison to models trained on manually annotated datasets. We present EvoTrends, a class-incremental dataset made from the web to capture real-world trends, created in just minutes. Overall, this paper underscores the potential of using uncurated webly-supervised data to mitigate the challenges associated with manual data labeling in continual learning.
♻ ☆ A Systematic Review on Sleep Stage Classification and Sleep Disorder Detection Using Artificial Intelligence
Sleep is vital for people's physical and mental health, and sound sleep can help them focus on daily activities. Therefore, a sleep study that includes sleep patterns and sleep disorders is crucial to enhancing our knowledge about individuals' health status. This study aims to provide a comprehensive, systematic review of the recent literature to analyze the different approaches and their outcomes in sleep studies, which includes works on "sleep stages classification" and "sleep disorder detection" using AI. In this review, 183 articles were initially selected from different journals, among which 80 records were enlisted for explicit review, ranging from 2016 to 2023. Brain waves were the most commonly employed body parameters for sleep staging and disorder studies (almost 29% of the research used brain activity signals exclusively, and 77% combined with the other signals). The convolutional neural network (CNN), the most widely used of the 34 distinct artificial intelligence models, comprised 27%. The other models included the long short-term memory (LSTM), support vector machine (SVM), random forest (RF), and recurrent neural network (RNN), which consisted of 11%, 6%, 6%, and 5% sequentially. For performance metrics, accuracy was widely used for a maximum of 83.75% of the cases, the F1 score of 45%, Kappa of 36.25%, Sensitivity of 31.25%, and Specificity of 30% of cases, along with the other metrics. This article would help physicians and researchers get the gist of AI's contribution to sleep studies and the feasibility of their intended work.
comment: 39 pages, 11 Figures, 8 Tables
♻ ☆ The Fault in our Stars: Quality Assessment of Code Generation Benchmarks SC
Large Language Models (LLMs) are gaining popularity among software engineers. A crucial aspect of developing effective code generation LLMs is to evaluate these models using a robust benchmark. Evaluation benchmarks with quality issues can provide a false sense of performance. In this work, we conduct the first-of-its-kind study of the quality of prompts within benchmarks used to compare the performance of different code generation models. To conduct this study, we analyzed 3,566 prompts from 9 code generation benchmarks to identify quality issues in them. We also investigated whether fixing the identified quality issues in the benchmarks' prompts affects a model's performance. We also studied memorization issues of the evaluation dataset, which can put into question a benchmark's trustworthiness. We found that code generation evaluation benchmarks mainly focused on Python and coding exercises and had very limited contextual dependencies to challenge the model. These datasets and the developers' prompts suffer from quality issues like spelling and grammatical errors, unclear sentences to express developers' intent, and not using proper documentation style. Fixing all these issues in the benchmarks can lead to a better performance for Python code generation, but not a significant improvement was observed for Java code generation. We also found evidence that GPT-3.5-Turbo and CodeGen-2.5 models may have data contamination issues.
comment: Accepted at the 24th IEEE International Conference on Source Code Analysis and Manipulation(SCAM 2024) Research Track
♻ ☆ Sample Complexity of Variance-reduced Distributionally Robust Q-learning
Dynamic decision-making under distributional shifts is of fundamental interest in theory and applications of reinforcement learning: The distribution of the environment in which the data is collected can differ from that of the environment in which the model is deployed. This paper presents two novel model-free algorithms, namely the distributionally robust Q-learning and its variance-reduced counterpart, that can effectively learn a robust policy despite distributional shifts. These algorithms are designed to efficiently approximate the $q$-function of an infinite-horizon $\gamma$-discounted robust Markov decision process with Kullback-Leibler ambiguity set to an entry-wise $\epsilon$-degree of precision. Further, the variance-reduced distributionally robust Q-learning combines the synchronous Q-learning with variance-reduction techniques to enhance its performance. Consequently, we establish that it attains a minimax sample complexity upper bound of $\tilde O(|\mathbf{S}||\mathbf{A}|(1-\gamma)^{-4}\epsilon^{-2})$, where $\mathbf{S}$ and $\mathbf{A}$ denote the state and action spaces. This is the first complexity result that is independent of the ambiguity size $\delta$, thereby providing new complexity theoretic insights. Additionally, a series of numerical experiments confirm the theoretical findings and the efficiency of the algorithms in handling distributional shifts.
♻ ☆ Predicting and Interpreting Energy Barriers of Metallic Glasses with Graph Neural Networks ICML 2024
Metallic Glasses (MGs) are widely used materials that are stronger than steel while being shapeable as plastic. While understanding the structure-property relationship of MGs remains a challenge in materials science, studying their energy barriers (EBs) as an intermediary step shows promise. In this work, we utilize Graph Neural Networks (GNNs) to model MGs and study EBs. We contribute a new dataset for EB prediction and a novel Symmetrized GNN (SymGNN) model that is E(3)-invariant in expectation. SymGNN handles invariance by aggregating over orthogonal transformations of the graph structure. When applied to EB prediction, SymGNN are more accurate than molecular dynamics (MD) local-sampling methods and other machine-learning models. Compared to precise MD simulations, SymGNN reduces the inference time on new MGs from roughly 41 days to less than one second. We apply explanation algorithms to reveal the relationship between structures and EBs. The structures that we identify through explanations match the medium-range order (MRO) hypothesis and possess unique topological properties. Our work enables effective prediction and interpretation of MG EBs, bolstering material science research.
comment: ICML 2024. Code available at https://github.com/haoyuli02/SymGNN
♻ ☆ Semi-Decentralized Federated Edge Learning for Fast Convergence on Non-IID Data
Federated edge learning (FEEL) has emerged as an effective approach to reduce the large communication latency in Cloud-based machine learning solutions, while preserving data privacy. Unfortunately, the learning performance of FEEL may be compromised due to limited training data in a single edge cluster. In this paper, we investigate a novel framework of FEEL, namely semi-decentralized federated edge learning (SD-FEEL). By allowing model aggregation across different edge clusters, SD-FEEL enjoys the benefit of FEEL in reducing the training latency, while improving the learning performance by accessing richer training data from multiple edge clusters. A training algorithm for SD-FEEL with three main procedures in each round is presented, including local model updates, intra-cluster and inter-cluster model aggregations, which is proved to converge on non-independent and identically distributed (non-IID) data. We also characterize the interplay between the network topology of the edge servers and the communication overhead of inter-cluster model aggregation on the training performance. Experiment results corroborate our analysis and demonstrate the effectiveness of SD-FFEL in achieving faster convergence than traditional federated learning architectures. Besides, guidelines on choosing critical hyper-parameters of the training algorithm are also provided.
♻ ☆ EnsLoss: Stochastic Calibrated Loss Ensembles for Preventing Overfitting in Classification
Empirical risk minimization (ERM) with a computationally feasible surrogate loss is a widely accepted approach for classification. Notably, the convexity and calibration (CC) properties of a loss function ensure consistency of ERM in maximizing accuracy, thereby offering a wide range of options for surrogate losses. In this article, we propose a novel ensemble method, namely EnsLoss, which extends the ensemble learning concept to combine loss functions within the ERM framework. A key feature of our method is the consideration on preserving the "legitimacy" of the combined losses, i.e., ensuring the CC properties. Specifically, we first transform the CC conditions of losses into loss-derivatives, thereby bypassing the need for explicit loss functions and directly generating calibrated loss-derivatives. Therefore, inspired by Dropout, EnsLoss enables loss ensembles through one training process with doubly stochastic gradient descent (i.e., random batch samples and random calibrated loss-derivatives). We theoretically establish the statistical consistency of our approach and provide insights into its benefits. The numerical effectiveness of EnsLoss compared to fixed loss methods is demonstrated through experiments on a broad range of 14 OpenML tabular datasets and 46 image datasets with various deep learning architectures. Python repository and source code are available on GitHub at https://github.com/statmlben/ensloss.
comment: 31 pages; 4 figures
♻ ☆ A Confidence Interval for the $\ell_2$ Expected Calibration Error
Recent advances in machine learning have significantly improved prediction accuracy in various applications. However, ensuring the calibration of probabilistic predictions remains a significant challenge. Despite efforts to enhance model calibration, the rigorous statistical evaluation of model calibration remains less explored. In this work, we develop confidence intervals the $\ell_2$ Expected Calibration Error (ECE). We consider top-1-to-$k$ calibration, which includes both the popular notion of confidence calibration as well as full calibration. For a debiased estimator of the ECE, we show asymptotic normality, but with different convergence rates and asymptotic variances for calibrated and miscalibrated models. We develop methods to construct asymptotically valid confidence intervals for the ECE, accounting for this behavior as well as non-negativity. Our theoretical findings are supported through extensive experiments, showing that our methods produce valid confidence intervals with shorter lengths compared to those obtained by resampling-based methods.
♻ ☆ A Novel Approach to Classify Power Quality Signals Using Vision Transformers
With the rapid integration of electronically interfaced renewable energy resources and loads into smart grids, there is increasing interest in power quality disturbances (PQD) classification to enhance the security and efficiency of these grids. This paper introduces a new approach to PQD classification based on the Vision Transformer (ViT) model. When a PQD occurs, the proposed approach first converts the power quality signal into an image and then utilizes a pre-trained ViT to accurately determine the class of the PQD. Unlike most previous works, which were limited to a few disturbance classes or small datasets, the proposed method is trained and tested on a large dataset with 17 disturbance classes. Our experimental results show that the proposed ViT-based approach achieves PQD classification precision and recall of 98.28% and 97.98%, respectively, outperforming recently proposed techniques applied to the same dataset.
comment: IECON 2024-50th Annual Conference of the IEEE Industrial Electronics Society, Chicago, U.S.A, 2024, pp. 1-6
♻ ☆ Graph-Based Bidirectional Transformer Decision Threshold Adjustment Algorithm for Class-Imbalanced Molecular Data
Data sets with imbalanced class sizes, where one class size is much smaller than that of others, occur exceedingly often in many applications, including those with biological foundations, such as disease diagnosis and drug discovery. Therefore, it is extremely important to be able to identify data elements of classes of various sizes, as a failure to do so can result in heavy costs. Nonetheless, many data classification procedures do not perform well on imbalanced data sets as they often fail to detect elements belonging to underrepresented classes. In this work, we propose the BTDT-MBO algorithm, incorporating Merriman-Bence-Osher (MBO) approaches and a bidirectional transformer, as well as distance correlation and decision threshold adjustments, for data classification tasks on highly imbalanced molecular data sets, where the sizes of the classes vary greatly. The proposed technique not only integrates adjustments in the classification threshold for the MBO algorithm in order to help deal with the class imbalance, but also uses a bidirectional transformer procedure based on an attention mechanism for self-supervised learning. In addition, the model implements distance correlation as a weight function for the similarity graph-based framework on which the adjusted MBO algorithm operates. The proposed method is validated using six molecular data sets and compared to other related techniques. The computational experiments show that the proposed technique is superior to competing approaches even in the case of a high class imbalance ratio.
♻ ☆ CCPL: Cross-modal Contrastive Protein Learning ICPR 2024
Effective protein representation learning is crucial for predicting protein functions. Traditional methods often pretrain protein language models on large, unlabeled amino acid sequences, followed by finetuning on labeled data. While effective, these methods underutilize the potential of protein structures, which are vital for function determination. Common structural representation techniques rely heavily on annotated data, limiting their generalizability. Moreover, structural pretraining methods, similar to natural language pretraining, can distort actual protein structures. In this work, we introduce a novel unsupervised protein structure representation pretraining method, cross-modal contrastive protein learning (CCPL). CCPL leverages a robust protein language model and uses unsupervised contrastive alignment to enhance structure learning, incorporating self-supervised structural constraints to maintain intrinsic structural information. We evaluated our model across various benchmarks, demonstrating the framework's superiority.
comment: Accepted to ICPR 2024
♻ ☆ Diffusion-Driven Data Replay: A Novel Approach to Combat Forgetting in Federated Class Continual Learning ECCV 2024
Federated Class Continual Learning (FCCL) merges the challenges of distributed client learning with the need for seamless adaptation to new classes without forgetting old ones. The key challenge in FCCL is catastrophic forgetting, an issue that has been explored to some extent in Continual Learning (CL). However, due to privacy preservation requirements, some conventional methods, such as experience replay, are not directly applicable to FCCL. Existing FCCL methods mitigate forgetting by generating historical data through federated training of GANs or data-free knowledge distillation. However, these approaches often suffer from unstable training of generators or low-quality generated data, limiting their guidance for the model. To address this challenge, we propose a novel method of data replay based on diffusion models. Instead of training a diffusion model, we employ a pre-trained conditional diffusion model to reverse-engineer each class, searching the corresponding input conditions for each class within the model's input space, significantly reducing computational resources and time consumption while ensuring effective generation. Furthermore, we enhance the classifier's domain generalization ability on generated and real data through contrastive learning, indirectly improving the representational capability of generated data for real data. Comprehensive experiments demonstrate that our method significantly outperforms existing baselines. Code is available at https://github.com/jinglin-liang/DDDR.
comment: Accepted by ECCV 2024 Oral
♻ ☆ Small noise analysis for Tikhonov and RKHS regularizations
Regularization plays a pivotal role in ill-posed machine learning and inverse problems. However, the fundamental comparative analysis of various regularization norms remains open. We establish a small noise analysis framework to assess the effects of norms in Tikhonov and RKHS regularizations, in the context of ill-posed linear inverse problems with Gaussian noise. This framework studies the convergence rates of regularized estimators in the small noise limit and reveals the potential instability of the conventional L2-regularizer. We solve such instability by proposing an innovative class of adaptive fractional RKHS regularizers, which covers the L2 Tikhonov and RKHS regularizations by adjusting the fractional smoothness parameter. A surprising insight is that over-smoothing via these fractional RKHSs consistently yields optimal convergence rates, but the optimal hyper-parameter may decay too fast to be selected in practice.
♻ ☆ Stacked ensemble\-based mutagenicity prediction model using multiple modalities with graph attention network
Mutagenicity is a concern due to its association with genetic mutations which can result in a variety of negative consequences, including the development of cancer. Earlier identification of mutagenic compounds in the drug development process is therefore crucial for preventing the progression of unsafe candidates and reducing development costs. While computational techniques, especially machine learning models have become increasingly prevalent for this endpoint, they rely on a single modality. In this work, we introduce a novel stacked ensemble based mutagenicity prediction model which incorporate multiple modalities such as simplified molecular input line entry system (SMILES) and molecular graph. These modalities capture diverse information about molecules such as substructural, physicochemical, geometrical and topological. To derive substructural, geometrical and physicochemical information, we use SMILES, while topological information is extracted through a graph attention network (GAT) via molecular graph. Our model uses a stacked ensemble of machine learning classifiers to make predictions using these multiple features. We employ the explainable artificial intelligence (XAI) technique SHAP (Shapley Additive Explanations) to determine the significance of each classifier and the most relevant features in the prediction. We demonstrate that our method surpasses SOTA methods on two standard datasets across various metrics. Notably, we achieve an area under the curve of 95.21\% on the Hansen benchmark dataset, affirming the efficacy of our method in predicting mutagenicity. We believe that this research will captivate the interest of both clinicians and computational biologists engaged in translational research.
comment: Submitted to a journal
♻ ☆ OpenVLA: An Open-Source Vision-Language-Action Model
Large policies pretrained on a combination of Internet-scale vision-language data and diverse robot demonstrations have the potential to change how we teach robots new skills: rather than training new behaviors from scratch, we can fine-tune such vision-language-action (VLA) models to obtain robust, generalizable policies for visuomotor control. Yet, widespread adoption of VLAs for robotics has been challenging as 1) existing VLAs are largely closed and inaccessible to the public, and 2) prior work fails to explore methods for efficiently fine-tuning VLAs for new tasks, a key component for adoption. Addressing these challenges, we introduce OpenVLA, a 7B-parameter open-source VLA trained on a diverse collection of 970k real-world robot demonstrations. OpenVLA builds on a Llama 2 language model combined with a visual encoder that fuses pretrained features from DINOv2 and SigLIP. As a product of the added data diversity and new model components, OpenVLA demonstrates strong results for generalist manipulation, outperforming closed models such as RT-2-X (55B) by 16.5% in absolute task success rate across 29 tasks and multiple robot embodiments, with 7x fewer parameters. We further show that we can effectively fine-tune OpenVLA for new settings, with especially strong generalization results in multi-task environments involving multiple objects and strong language grounding abilities, and outperform expressive from-scratch imitation learning methods such as Diffusion Policy by 20.4%. We also explore compute efficiency; as a separate contribution, we show that OpenVLA can be fine-tuned on consumer GPUs via modern low-rank adaptation methods and served efficiently via quantization without a hit to downstream success rate. Finally, we release model checkpoints, fine-tuning notebooks, and our PyTorch codebase with built-in support for training VLAs at scale on Open X-Embodiment datasets.
comment: Website: https://openvla.github.io/
♻ ☆ SELF-[IN]CORRECT: LLMs Struggle with Discriminating Self-Generated Responses
Can LLMs consistently improve their previous outputs for better results? For this to be true, LLMs would need to be better at discriminating among previously-generated alternatives, than generating initial responses. We explore the validity of this hypothesis in practice. We first formulate a unified framework that allows us to compare the generative and discriminative capability of any model on any task. In our resulting experimental analysis of several open-source and industrial LLMs, we observe that models are not reliably better at discriminating among previously-generated alternatives than generating initial responses. This finding challenges the notion that LLMs may be able to enhance their performance only through their own judgment.
♻ ☆ Thresholded Lexicographic Ordered Multiobjective Reinforcement Learning ECAI 2024
Lexicographic multi-objective problems, which impose a lexicographic importance order over the objectives, arise in many real-life scenarios. Existing Reinforcement Learning work directly addressing lexicographic tasks has been scarce. The few proposed approaches were all noted to be heuristics without theoretical guarantees as the Bellman equation is not applicable to them. Additionally, the practical applicability of these prior approaches also suffers from various issues such as not being able to reach the goal state. While some of these issues have been known before, in this work we investigate further shortcomings, and propose fixes for improving practical performance in many cases. We also present a policy optimization approach using our Lexicographic Projection Optimization (LPO) algorithm that has the potential to address these theoretical and practical concerns. Finally, we demonstrate our proposed algorithms on benchmark problems.
comment: Full version of ECAI 2024 paper
♻ ☆ LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet
Recent large language model (LLM) defenses have greatly improved models' ability to refuse harmful queries, even when adversarially attacked. However, LLM defenses are primarily evaluated against automated adversarial attacks in a single turn of conversation, an insufficient threat model for real-world malicious use. We demonstrate that multi-turn human jailbreaks uncover significant vulnerabilities, exceeding 70% attack success rate (ASR) on HarmBench against defenses that report single-digit ASRs with automated single-turn attacks. Human jailbreaks also reveal vulnerabilities in machine unlearning defenses, successfully recovering dual-use biosecurity knowledge from unlearned models. We compile these results into Multi-Turn Human Jailbreaks (MHJ), a dataset of 2,912 prompts across 537 multi-turn jailbreaks. We publicly release MHJ alongside a compendium of jailbreak tactics developed across dozens of commercial red teaming engagements, supporting research towards stronger LLM defenses.
♻ ☆ Anchored Preference Optimization and Contrastive Revisions: Addressing Underspecification in Alignment
Large Language Models (LLMs) are often aligned using contrastive alignment objectives and preference pair datasets. The interaction between model, paired data, and objective makes alignment a complicated procedure, sometimes producing subpar results. We study this and find that (i) preference data gives a better learning signal when the underlying responses are contrastive, and (ii) alignment objectives lead to better performance when they specify more control over the model during training. Based on these insights, we introduce Contrastive Learning from AI Revisions (CLAIR), a data-creation method which leads to more contrastive preference pairs, and Anchored Preference Optimization (APO), a controllable and more stable alignment objective. We align Llama-3-8B-Instruct using various comparable datasets and alignment objectives and measure MixEval-Hard scores, which correlate highly with human judgments. The CLAIR preferences lead to the strongest performance out of all datasets, and APO consistently outperforms less controllable objectives. Our best model, trained on 32K CLAIR preferences with APO, improves Llama-3-8B-Instruct by 7.65%, closing the gap with GPT4-turbo by 45%. Our code is available at https://github.com/ContextualAI/CLAIR_and_APO.
♻ ☆ From Lab to Field: Real-World Evaluation of an AI-Driven Smart Video Solution to Enhance Community Safety
This article adopts and evaluates an AI-enabled Smart Video Solution (SVS) designed to enhance safety in the real world. The system integrates with existing infrastructure camera networks, leveraging recent advancements in AI for easy adoption. Prioritizing privacy and ethical standards, pose based data is used for downstream AI tasks such as anomaly detection. Cloud-based infrastructure and mobile app are deployed, enabling real-time alerts within communities. The SVS employs innovative data representation and visualization techniques, such as the Occupancy Indicator, Statistical Anomaly Detection, Bird's Eye View, and Heatmaps, to understand pedestrian behaviors and enhance public safety. Evaluation of the SVS demonstrates its capacity to convert complex computer vision outputs into actionable insights for stakeholders, community partners, law enforcement, urban planners, and social scientists. This article presents a comprehensive real-world deployment and evaluation of the SVS, implemented in a community college environment across 16 cameras. The system integrates AI-driven visual processing, supported by statistical analysis, database management, cloud communication, and user notifications. Additionally, the article evaluates the end-to-end latency from the moment an AI algorithm detects anomalous behavior in real-time at the camera level to the time stakeholders receive a notification. The results demonstrate the system's robustness, effectively managing 16 CCTV cameras with a consistent throughput of 16.5 frames per second (FPS) over a 21-hour period and an average end-to-end latency of 26.76 seconds between anomaly detection and alert issuance.
Multimedia
☆ LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture
Expanding the long-context capabilities of Multi-modal Large Language Models~(MLLMs) is crucial for video understanding, high-resolution image understanding, and multi-modal agents. This involves a series of systematic optimizations, including model architecture, data construction and training strategy, particularly addressing challenges such as \textit{degraded performance with more images} and \textit{high computational costs}. In this paper, we adapt the model architecture to a hybrid of Mamba and Transformer blocks, approach data construction with both temporal and spatial dependencies among multiple images and employ a progressive training strategy. The released model \textbf{LongLLaVA}~(\textbf{Long}-Context \textbf{L}arge \textbf{L}anguage \textbf{a}nd \textbf{V}ision \textbf{A}ssistant) is the first hybrid MLLM, which achieved a better balance between efficiency and effectiveness. LongLLaVA not only achieves competitive results across various benchmarks, but also maintains high throughput and low memory consumption. Especially, it could process nearly a thousand images on a single A100 80GB GPU, showing promising application prospects for a wide range of tasks.
comment: 19 pages, 7 figures, 6 tables
☆ Multi-Track MusicLDM: Towards Versatile Music Generation with Latent Diffusion Model
Diffusion models have shown promising results in cross-modal generation tasks involving audio and music, such as text-to-sound and text-to-music generation. These text-controlled music generation models typically focus on generating music by capturing global musical attributes like genre and mood. However, music composition is a complex, multilayered task that often involves musical arrangement as an integral part of the process. This process involves composing each instrument to align with existing ones in terms of beat, dynamics, harmony, and melody, requiring greater precision and control over tracks than text prompts usually provide. In this work, we address these challenges by extending the MusicLDM, a latent diffusion model for music, into a multi-track generative model. By learning the joint probability of tracks sharing a context, our model is capable of generating music across several tracks that correspond well to each other, either conditionally or unconditionally. Additionally, our model is capable of arrangement generation, where the model can generate any subset of tracks given the others (e.g., generating a piano track complementing given bass and drum tracks). We compared our model with an existing multi-track generative model and demonstrated that our model achieves considerable improvements across objective metrics for both total and arrangement generation tasks.
☆ ExpLLM: Towards Chain of Thought for Facial Expression Recognition
Facial expression recognition (FER) is a critical task in multimedia with significant implications across various domains. However, analyzing the causes of facial expressions is essential for accurately recognizing them. Current approaches, such as those based on facial action units (AUs), typically provide AU names and intensities but lack insight into the interactions and relationships between AUs and the overall expression. In this paper, we propose a novel method called ExpLLM, which leverages large language models to generate an accurate chain of thought (CoT) for facial expression recognition. Specifically, we have designed the CoT mechanism from three key perspectives: key observations, overall emotional interpretation, and conclusion. The key observations describe the AU's name, intensity, and associated emotions. The overall emotional interpretation provides an analysis based on multiple AUs and their interactions, identifying the dominant emotions and their relationships. Finally, the conclusion presents the final expression label derived from the preceding analysis. Furthermore, we also introduce the Exp-CoT Engine, designed to construct this expression CoT and generate instruction-description data for training our ExpLLM. Extensive experiments on the RAF-DB and AffectNet datasets demonstrate that ExpLLM outperforms current state-of-the-art FER methods. ExpLLM also surpasses the latest GPT-4o in expression CoT generation, particularly in recognizing micro-expressions where GPT-4o frequently fails.
comment: project page: https://starhiking.github.io/ExpLLM_Page/
☆ PoseTalk: Text-and-Audio-based Pose Control and Motion Refinement for One-Shot Talking Head Generation
While previous audio-driven talking head generation (THG) methods generate head poses from driving audio, the generated poses or lips cannot match the audio well or are not editable. In this study, we propose \textbf{PoseTalk}, a THG system that can freely generate lip-synchronized talking head videos with free head poses conditioned on text prompts and audio. The core insight of our method is using head pose to connect visual, linguistic, and audio signals. First, we propose to generate poses from both audio and text prompts, where the audio offers short-term variations and rhythm correspondence of the head movements and the text prompts describe the long-term semantics of head motions. To achieve this goal, we devise a Pose Latent Diffusion (PLD) model to generate motion latent from text prompts and audio cues in a pose latent space. Second, we observe a loss-imbalance problem: the loss for the lip region contributes less than 4\% of the total reconstruction loss caused by both pose and lip, making optimization lean towards head movements rather than lip shapes. To address this issue, we propose a refinement-based learning strategy to synthesize natural talking videos using two cascaded networks, i.e., CoarseNet, and RefineNet. The CoarseNet estimates coarse motions to produce animated images in novel poses and the RefineNet focuses on learning finer lip motions by progressively estimating lip motions from low-to-high resolutions, yielding improved lip-synchronization performance. Experiments demonstrate our pose prediction strategy achieves better pose diversity and realness compared to text-only or audio-only, and our video generator model outperforms state-of-the-art methods in synthesizing talking videos with natural head motions. Project: https://junleen.github.io/projects/posetalk.
comment: 7+5 pages, 15 figures
☆ Low-Resolution Object Recognition with Cross-Resolution Relational Contrastive Distillation
Recognizing objects in low-resolution images is a challenging task due to the lack of informative details. Recent studies have shown that knowledge distillation approaches can effectively transfer knowledge from a high-resolution teacher model to a low-resolution student model by aligning cross-resolution representations. However, these approaches still face limitations in adapting to the situation where the recognized objects exhibit significant representation discrepancies between training and testing images. In this study, we propose a cross-resolution relational contrastive distillation approach to facilitate low-resolution object recognition. Our approach enables the student model to mimic the behavior of a well-trained teacher model which delivers high accuracy in identifying high-resolution objects. To extract sufficient knowledge, the student learning is supervised with contrastive relational distillation loss, which preserves the similarities in various relational structures in contrastive representation space. In this manner, the capability of recovering missing details of familiar low-resolution objects can be effectively enhanced, leading to a better knowledge transfer. Extensive experiments on low-resolution object classification and low-resolution face recognition clearly demonstrate the effectiveness and adaptability of our approach.
comment: This paper is accepted by IEEE Transactions on Circuits and Systems for Video Technology (TCSVT)
☆ FrameCorr: Adaptive, Autoencoder-based Neural Compression for Video Reconstruction in Resource and Timing Constrained Network Settings
Despite the growing adoption of video processing via Internet of Things (IoT) devices due to their cost-effectiveness, transmitting captured data to nearby servers poses challenges due to varying timing constraints and scarcity of network bandwidth. Existing video compression methods face difficulties in recovering compressed data when incomplete data is provided. Here, we introduce \emph{\project}, a deep-learning based solution that utilizes previously received data to predict the missing segments of a frame, enabling the reconstruction of a frame from partially received data.
☆ Coral Model Generation from Single Images for Virtual Reality Applications
With the rapid development of VR technology, the demand for high-quality 3D models is increasing. Traditional methods struggle with efficiency and quality in large-scale customization. This paper introduces a deep-learning framework that generates high-precision 3D coral models from a single image. Using the Coral dataset, the framework extracts geometric and texture features, performs 3D reconstruction, and optimizes design and material blending. Advanced optimization and polygon count control ensure shape accuracy, detail retention, and flexible output for various complexities, catering to high-quality rendering and real-time interaction needs.The project incorporates Explainable AI (XAI) to transform AI-generated models into interactive "artworks," best viewed in VR and XR. This enhances model interpretability and human-machine collaboration. Real-time feedback in VR interactions displays information like coral species and habitat, enriching user experience. The generated models surpass traditional methods in detail, visual quality, and efficiency. This research offers an intelligent approach to 3D content creation for VR, lowering production barriers, and promoting widespread VR applications. Additionally, integrating XAI provides new insights into AI-generated visual content and advances research in 3D vision interpretability.
comment: In Proceedings of Explainable AI for the Arts Workshop 2024 (XAIxArts 2024) arXiv:2406.14485
♻ ☆ Hand1000: Generating Realistic Hands from Text with Only 1,000 Images
Text-to-image generation models have achieved remarkable advancements in recent years, aiming to produce realistic images from textual descriptions. However, these models often struggle with generating anatomically accurate representations of human hands. The resulting images frequently exhibit issues such as incorrect numbers of fingers, unnatural twisting or interlacing of fingers, or blurred and indistinct hands. These issues stem from the inherent complexity of hand structures and the difficulty in aligning textual descriptions with precise visual depictions of hands. To address these challenges, we propose a novel approach named Hand1000 that enables the generation of realistic hand images with target gesture using only 1,000 training samples. The training of Hand1000 is divided into three stages with the first stage aiming to enhance the model's understanding of hand anatomy by using a pre-trained hand gesture recognition model to extract gesture representation. The second stage further optimizes text embedding by incorporating the extracted hand gesture representation, to improve alignment between the textual descriptions and the generated hand images. The third stage utilizes the optimized embedding to fine-tune the Stable Diffusion model to generate realistic hand images. In addition, we construct the first publicly available dataset specifically designed for text-to-hand image generation. Based on the existing hand gesture recognition dataset, we adopt advanced image captioning models and LLaMA3 to generate high-quality textual descriptions enriched with detailed gesture information. Extensive experiments demonstrate that Hand1000 significantly outperforms existing models in producing anatomically correct hand images while faithfully representing other details in the text, such as faces, clothing, and colors.
comment: Project page https://haozhuo-zhang.github.io/Hand1000-project-page/
♻ ☆ MCDubber: Multimodal Context-Aware Expressive Video Dubbing SC2024
Automatic Video Dubbing (AVD) aims to take the given script and generate speech that aligns with lip motion and prosody expressiveness. Current AVD models mainly utilize visual information of the current sentence to enhance the prosody of synthesized speech. However, it is crucial to consider whether the prosody of the generated dubbing aligns with the multimodal context, as the dubbing will be combined with the original context in the final video. This aspect has been overlooked in previous studies. To address this issue, we propose a Multimodal Context-aware video Dubbing model, termed \textbf{MCDubber}, to convert the modeling object from a single sentence to a longer sequence with context information to ensure the consistency of the global context prosody. MCDubber comprises three main components: (1) A context duration aligner aims to learn the context-aware alignment between the text and lip frames; (2) A context prosody predictor seeks to read the global context visual sequence and predict the context-aware global energy and pitch; (3) A context acoustic decoder ultimately predicts the global context mel-spectrogram with the assistance of adjacent ground-truth mel-spectrograms of the target sentence. Through this process, MCDubber fully considers the influence of multimodal context on the prosody expressiveness of the current sentence when dubbing. The extracted mel-spectrogram belonging to the target sentence from the output context mel-spectrograms is the final required dubbing audio. Extensive experiments on the Chem benchmark dataset demonstrate that our MCDubber significantly improves dubbing expressiveness compared to all advanced baselines. The code and demos are available at https://github.com/XiaoYuanJun-zy/MCDubber.
comment: Accepted by NCMMSC2024
Computation and Language
☆ Arctic-SnowCoder: Demystifying High-Quality Data in Code Pretraining
Recent studies have been increasingly demonstrating that high-quality data is crucial for effective pretraining of language models. However, the precise definition of "high-quality" remains underexplored. Focusing on the code domain, we introduce Arctic-SnowCoder-1.3B, a data-efficient base code model pretrained on 555B tokens through three phases of progressively refined data: (1) general pretraining with 500B standard-quality code tokens, preprocessed through basic filtering, deduplication, and decontamination, (2) continued pretraining with 50B high-quality tokens, selected from phase one by a BERT-style quality annotator trained to distinguish good code from random data, using positive examples drawn from high-quality code files, along with instruction data from Magicoder and StarCoder2-Instruct, and (3) enhanced pretraining with 5B synthetic data created by Llama-3.1-70B using phase two data as seeds, adapting the Magicoder approach for pretraining. Despite being trained on a limited dataset, Arctic-SnowCoder achieves state-of-the-art performance on BigCodeBench, a coding benchmark focusing on practical and challenging programming tasks, compared to similarly sized models trained on no more than 1T tokens, outperforming Phi-1.5-1.3B by 36%. Across all evaluated benchmarks, Arctic-SnowCoder-1.3B beats StarCoderBase-3B pretrained on 1T tokens. Additionally, it matches the performance of leading small base code models trained on trillions of tokens. For example, Arctic-SnowCoder-1.3B surpasses StarCoder2-3B, pretrained on over 3.3T tokens, on HumanEval+, a benchmark that evaluates function-level code generation, and remains competitive on BigCodeBench. Our evaluation presents a comprehensive analysis justifying various design choices for Arctic-SnowCoder. Most importantly, we find that the key to high-quality data is its alignment with the distribution of downstream applications.
☆ Optimal L-Systems for Stochastic L-system Inference Problems
This paper presents two novel theorems that address two open problems in stochastic Lindenmayer-system (L-system) inference, specifically focusing on the construction of an optimal stochastic L-system capable of generating a given sequence of strings. The first theorem delineates a method for crafting a stochastic L-system that maximizes the likelihood of producing a given sequence of words through a singular derivation. Furthermore, the second theorem determines the stochastic L-systems with the highest probability of producing a given sequence of words with multiple possible derivations. From these, we introduce an algorithm to infer an optimal stochastic L-system from a given sequence. This algorithm incorporates sophisticated optimization techniques, such as interior point methods, ensuring production of a stochastically optimal stochastic L-system suitable for generating the given sequence. This allows for the use of using stochastic L-systems as model for machine learning using only positive data for training.
☆ MMLU-Pro+: Evaluating Higher-Order Reasoning and Shortcut Learning in LLMs
Existing benchmarks for large language models (LLMs) increasingly struggle to differentiate between top-performing models, underscoring the need for more challenging evaluation frameworks. We introduce MMLU-Pro+, an enhanced benchmark building upon MMLU-Pro to assess shortcut learning and higher-order reasoning in LLMs. By incorporating questions with multiple correct answers across diverse domains, MMLU-Pro+ tests LLMs' ability to engage in complex reasoning and resist simplistic problem-solving strategies. Our results show that MMLU-Pro+ maintains MMLU-Pro's difficulty while providing a more rigorous test of model discrimination, particularly in multi-correct answer scenarios. We introduce novel metrics like shortcut selection ratio and correct pair identification ratio, offering deeper insights into model behavior and anchoring bias. Evaluations of five state-of-the-art LLMs reveal significant performance gaps, highlighting variations in reasoning abilities and bias susceptibility. We release the dataset and evaluation codes at \url{https://github.com/asgsaeid/mmlu-pro-plus}.
☆ Therapy as an NLP Task: Psychologists' Comparison of LLMs and Human Peers in CBT
Wider access to therapeutic care is one of the biggest challenges in mental health treatment. Due to institutional barriers, some people seeking mental health support have turned to large language models (LLMs) for personalized therapy, even though these models are largely unsanctioned and untested. We investigate the potential and limitations of using LLMs as providers of evidence-based therapy by using mixed methods clinical metrics. Using HELPERT, a prompt run on a large language model using the same process and training as a comparative group of peer counselors, we replicated publicly accessible mental health conversations rooted in Cognitive Behavioral Therapy (CBT) to compare session dynamics and counselor's CBT-based behaviors between original peer support sessions and their reconstructed HELPERT sessions. Two licensed, CBT-trained clinical psychologists evaluated the sessions using the Cognitive Therapy Rating Scale and provided qualitative feedback. Our findings show that the peer sessions are characterized by empathy, small talk, therapeutic alliance, and shared experiences but often exhibit therapist drift. Conversely, HELPERT reconstructed sessions exhibit minimal therapist drift and higher adherence to CBT methods but display a lack of collaboration, empathy, and cultural understanding. Through CTRS ratings and psychologists' feedback, we highlight the importance of human-AI collaboration for scalable mental health. Our work outlines the ethical implication of imparting human-like subjective qualities to LLMs in therapeutic settings, particularly the risk of deceptive empathy, which may lead to unrealistic patient expectations and potential harm.
☆ Temporal Order Preserved Optimal Transport-based Cross-modal Knowledge Transfer Learning for ASR
Transferring linguistic knowledge from a pretrained language model (PLM) to an acoustic model has been shown to greatly improve the performance of automatic speech recognition (ASR). However, due to the heterogeneous feature distributions in cross-modalities, designing an effective model for feature alignment and knowledge transfer between linguistic and acoustic sequences remains a challenging task. Optimal transport (OT), which efficiently measures probability distribution discrepancies, holds great potential for aligning and transferring knowledge between acoustic and linguistic modalities. Nonetheless, the original OT treats acoustic and linguistic feature sequences as two unordered sets in alignment and neglects temporal order information during OT coupling estimation. Consequently, a time-consuming pretraining stage is required to learn a good alignment between the acoustic and linguistic representations. In this paper, we propose a Temporal Order Preserved OT (TOT)-based Cross-modal Alignment and Knowledge Transfer (CAKT) (TOT-CAKT) for ASR. In the TOT-CAKT, local neighboring frames of acoustic sequences are smoothly mapped to neighboring regions of linguistic sequences, preserving their temporal order relationship in feature alignment and matching. With the TOT-CAKT model framework, we conduct Mandarin ASR experiments with a pretrained Chinese PLM for linguistic knowledge transfer. Our results demonstrate that the proposed TOT-CAKT significantly improves ASR performance compared to several state-of-the-art models employing linguistic knowledge transfer, and addresses the weaknesses of the original OT-based method in sequential feature alignment for ASR.
comment: Accepted to IEEE SLT 2024
☆ Unforgettable Generalization in Language Models
When language models (LMs) are trained to forget (or "unlearn'') a skill, how precisely does their behavior change? We study the behavior of transformer LMs in which tasks have been forgotten via fine-tuning on randomized labels. Such LMs learn to generate near-random predictions for individual examples in the "training'' set used for forgetting. Across tasks, however, LMs exhibit extreme variability in whether LM predictions change on examples outside the training set. In some tasks (like entailment classification), forgetting generalizes robustly, and causes models to produce uninformative predictions on new task instances; in other tasks (like physical commonsense reasoning and scientific question answering) forgetting affects only the training examples, and models continue to perform the "forgotten'' task accurately even for examples very similar to those that appeared in the training set. Dataset difficulty is not predictive of whether a behavior can be forgotten; instead, generalization in forgetting is (weakly) predicted by the confidence of LMs' initial task predictions and the variability of LM representations of training data, with low confidence and low variability both associated with greater generalization. Perhaps most surprisingly, random-label forgetting appears to be somewhat insensitive to the contents of the training set: for example, models trained on science questions with random labels continue to answer other science questions accurately, but begin to produce random labels on entailment classification tasks. Finally, we show that even generalizable forgetting is shallow: linear probes trained on LMs' representations can still perform tasks reliably after forgetting. Our results highlight the difficulty and unpredictability of performing targeted skill removal from models via fine-tuning.
comment: 18 pages, 9 figures, published in First Conference on Language Modeling 2024
☆ Visually Grounded Speech Models for Low-resource Languages and Cognitive Modelling
This dissertation examines visually grounded speech (VGS) models that learn from unlabelled speech paired with images. It focuses on applications for low-resource languages and understanding human language acquisition. We introduce a task called visually prompted keyword localisation to detect and localise keywords in speech using images. We demonstrate the effectiveness of VGS models in few-shot learning scenarios for low-resource languages like Yoruba. Additionally, we examine the mutual exclusivity bias in VGS models. Our monolingual VGS model exhibits this bias, but we found that multilingualism does not affect the bias in this VGS model similarly to what is observed in children.
comment: PhD Dissertation
☆ CRAFT Your Dataset: Task-Specific Synthetic Dataset Generation Through Corpus Retrieval and Augmentation
Building high-quality datasets for specialized tasks is a time-consuming and resource-intensive process that often requires specialized domain knowledge. We propose Corpus Retrieval and Augmentation for Fine-Tuning (CRAFT), a method for generating synthetic datasets, given a small number of user-written few-shots that demonstrate the task to be performed. Given the few-shot examples, we use large-scale public web-crawled corpora and similarity-based document retrieval to find other relevant human-written documents. Lastly, instruction-tuned large language models (LLMs) augment the retrieved documents into custom-formatted task samples, which then can be used for fine-tuning. We demonstrate that CRAFT can efficiently generate large-scale task-specific training datasets for four diverse tasks: biology question-answering (QA), medicine QA and commonsense QA as well as summarization. Our experiments show that CRAFT-based models outperform or achieve comparable performance to general LLMs for QA tasks, while CRAFT-based summarization models outperform models trained on human-curated data by 46 preference points.
☆ Political DEBATE: Efficient Zero-shot and Few-shot Classifiers for Political Text
Social scientists quickly adopted large language models due to their ability to annotate documents without supervised training, an ability known as zero-shot learning. However, due to their compute demands, cost, and often proprietary nature, these models are often at odds with replication and open science standards. This paper introduces the Political DEBATE (DeBERTa Algorithm for Textual Entailment) language models for zero-shot and few-shot classification of political documents. These models are not only as good, or better than, state-of-the art large language models at zero and few-shot classification, but are orders of magnitude more efficient and completely open source. By training the models on a simple random sample of 10-25 documents, they can outperform supervised classifiers trained on hundreds or thousands of documents and state-of-the-art generative models with complex, engineered prompts. Additionally, we release the PolNLI dataset used to train these models -- a corpus of over 200,000 political documents with highly accurate labels across over 800 classification tasks.
comment: 26 pages, 5 figures
☆ Spinning the Golden Thread: Benchmarking Long-Form Generation in Language Models
The abilities of long-context language models (LMs) are often evaluated using the "Needle-in-a-Haystack" (NIAH) test, which comprises tasks designed to assess a model's ability to identify specific information ("needle") within large text sequences ("haystack"). While these benchmarks measure how well models understand long-context input sequences, they do not effectively gauge the quality of long-form text generation--a critical aspect for applications such as design proposals and creative writing. To address this gap, we have introduced a new long-form text evaluation benchmark, Spinning the Golden Thread (SGT), which tests models' ability to identify specific events within generated long text sequences. In this benchmark, we prompt long-context LMs to create long-form text that must include particular events or constraints and evaluate their ability to incorporate these elements. We evaluated ten long-context LMs across four distinct scenarios, three types of prompt instructions, and two different generation-length settings (16K and 32K). Although these models perform well on NIAH benchmarks, none demonstrated satisfactory performance on the Spinning the Golden Thread, raising concerns about their ability to generate coherent long-form text that follows instructions. Additionally, as the length of the generated text increases, all models exhibit a significant drop in performance.
☆ OLMoE: Open Mixture-of-Experts Language Models
We introduce OLMoE, a fully open, state-of-the-art language model leveraging sparse Mixture-of-Experts (MoE). OLMoE-1B-7B has 7 billion (B) parameters but uses only 1B per input token. We pretrain it on 5 trillion tokens and further adapt it to create OLMoE-1B-7B-Instruct. Our models outperform all available models with similar active parameters, even surpassing larger ones like Llama2-13B-Chat and DeepSeekMoE-16B. We present various experiments on MoE training, analyze routing in our model showing high specialization, and open-source all aspects of our work: model weights, training data, code, and logs.
comment: 61 pages (24 main), 36 figures, 14 tables
☆ Enhancing Code-Switching Speech Recognition with LID-Based Collaborative Mixture of Experts Model
Due to the inherent difficulty in modeling phonetic similarities across different languages, code-switching speech recognition presents a formidable challenge. This study proposes a Collaborative-MoE, a Mixture of Experts (MoE) model that leverages a collaborative mechanism among expert groups. Initially, a preceding routing network explicitly learns Language Identification (LID) tasks and selects experts based on acquired LID weights. This process ensures robust routing information to the MoE layer, mitigating interference from diverse language domains on expert network parameter updates. The LID weights are also employed to facilitate inter-group collaboration, enabling the integration of language-specific representations. Furthermore, within each language expert group, a gating network operates unsupervised to foster collaboration on attributes beyond language. Extensive experiments demonstrate the efficacy of our approach, achieving significant performance enhancements compared to alternative methods. Importantly, our method preserves the efficient inference capabilities characteristic of MoE models without necessitating additional pre-training.
comment: Accepted to IEEE SLT 2024
☆ BEAVER: An Enterprise Benchmark for Text-to-SQL
Existing text-to-SQL benchmarks have largely been constructed using publicly available tables from the web with human-generated tests containing question and SQL statement pairs. They typically show very good results and lead people to think that LLMs are effective at text-to-SQL tasks. In this paper, we apply off-the-shelf LLMs to a benchmark containing enterprise data warehouse data. In this environment, LLMs perform poorly, even when standard prompt engineering and RAG techniques are utilized. As we will show, the reasons for poor performance are largely due to three characteristics: (1) public LLMs cannot train on enterprise data warehouses because they are largely in the "dark web", (2) schemas of enterprise tables are more complex than the schemas in public data, which leads the SQL-generation task innately harder, and (3) business-oriented questions are often more complex, requiring joins over multiple tables and aggregations. As a result, we propose a new dataset BEAVER, sourced from real enterprise data warehouses together with natural language queries and their correct SQL statements which we collected from actual user history. We evaluated this dataset using recent LLMs and demonstrated their poor performance on this task. We hope this dataset will facilitate future researchers building more sophisticated text-to-SQL systems which can do better on this important class of data.
☆ Foundations of Large Language Model Compression -- Part 1: Weight Quantization
In recent years, compression of large language models (LLMs) has emerged as an important problem to allow language model deployment on resource-constrained devices, reduce computational costs, and mitigate the environmental footprint of large-scale AI infrastructure. In this paper, we present the foundations of LLM quantization from a convex optimization perspective and propose a quantization method that builds on these foundations and outperforms previous methods. Our quantization framework, CVXQ, scales to models containing hundreds of billions of weight parameters and provides users with the flexibility to compress models to any specified model size, post-training. A reference implementation of CVXQ can be obtained from https://github.com/seannz/cvxq.
comment: Preprint
☆ FuzzCoder: Byte-level Fuzzing Test via Large Language Model
Fuzzing is an important dynamic program analysis technique designed for finding vulnerabilities in complex software. Fuzzing involves presenting a target program with crafted malicious input to cause crashes, buffer overflows, memory errors, and exceptions. Crafting malicious inputs in an efficient manner is a difficult open problem and the best approaches often apply uniform random mutations to pre-existing valid inputs. In this work, we propose to adopt fine-tuned large language models (FuzzCoder) to learn patterns in the input files from successful attacks to guide future fuzzing explorations. Specifically, we develop a framework to leverage the code LLMs to guide the mutation process of inputs in fuzzing. The mutation process is formulated as the sequence-to-sequence modeling, where LLM receives a sequence of bytes and then outputs the mutated byte sequence. FuzzCoder is fine-tuned on the created instruction dataset (Fuzz-Instruct), where the successful fuzzing history is collected from the heuristic fuzzing tool. FuzzCoder can predict mutation locations and strategies locations in input files to trigger abnormal behaviors of the program. Experimental results show that FuzzCoder based on AFL (American Fuzzy Lop) gain significant improvements in terms of effective proportion of mutation (EPM) and number of crashes (NC) for various input formats including ELF, JPG, MP3, and XML.
comment: 11 pages
☆ Towards Leveraging Large Language Models for Automated Medical Q&A Evaluation
This paper explores the potential of using Large Language Models (LLMs) to automate the evaluation of responses in medical Question and Answer (Q\&A) systems, a crucial form of Natural Language Processing. Traditionally, human evaluation has been indispensable for assessing the quality of these responses. However, manual evaluation by medical professionals is time-consuming and costly. Our study examines whether LLMs can reliably replicate human evaluations by using questions derived from patient data, thereby saving valuable time for medical experts. While the findings suggest promising results, further research is needed to address more specific or complex questions that were beyond the scope of this initial investigation.
comment: 10 pages, 3 figures, 3 tables
☆ 3D-LEX v1.0: 3D Lexicons for American Sign Language and Sign Language of the Netherlands
In this work, we present an efficient approach for capturing sign language in 3D, introduce the 3D-LEX v1.0 dataset, and detail a method for semi-automatic annotation of phonetic properties. Our procedure integrates three motion capture techniques encompassing high-resolution 3D poses, 3D handshapes, and depth-aware facial features, and attains an average sampling rate of one sign every 10 seconds. This includes the time for presenting a sign example, performing and recording the sign, and archiving the capture. The 3D-LEX dataset includes 1,000 signs from American Sign Language and an additional 1,000 signs from the Sign Language of the Netherlands. We showcase the dataset utility by presenting a simple method for generating handshape annotations directly from 3D-LEX. We produce handshape labels for 1,000 signs from American Sign Language and evaluate the labels in a sign recognition task. The labels enhance gloss recognition accuracy by 5% over using no handshape annotations, and by 1% over expert annotations. Our motion capture data supports in-depth analysis of sign features and facilitates the generation of 2D projections from any viewpoint. The 3D-LEX collection has been aligned with existing sign language benchmarks and linguistic resources, to support studies in 3D-aware sign language processing.
☆ What are the Essential Factors in Crafting Effective Long Context Multi-Hop Instruction Datasets? Insights and Best Practices
Recent advancements in large language models (LLMs) with extended context windows have significantly improved tasks such as information extraction, question answering, and complex planning scenarios. In order to achieve success in long context tasks, a large amount of work has been done to enhance the long context capabilities of the model through synthetic data. Existing methods typically utilize the Self-Instruct framework to generate instruction tuning data for better long context capability improvement. However, our preliminary experiments indicate that less than 35% of generated samples are multi-hop, and more than 40% exhibit poor quality, limiting comprehensive understanding and further research. To improve the quality of synthetic data, we propose the Multi-agent Interactive Multi-hop Generation (MIMG) framework, incorporating a Quality Verification Agent, a Single-hop Question Generation Agent, a Multiple Question Sampling Strategy, and a Multi-hop Question Merger Agent. This framework improves the data quality, with the proportion of high-quality, multi-hop, and diverse data exceeding 85%. Furthermore, we systematically investigate strategies for document selection, question merging, and validation techniques through extensive experiments across various models. Our findings show that our synthetic high-quality long-context instruction data significantly enhances model performance, even surpassing models trained on larger amounts of human-annotated data. Our code is available at: https://github.com/WowCZ/LongMIT.
comment: Work in progress
☆ Investigating Expert-in-the-Loop LLM Discourse Patterns for Ancient Intertextual Analysis
This study explores the potential of large language models (LLMs) for identifying and examining intertextual relationships within biblical, Koine Greek texts. By evaluating the performance of LLMs on various intertextuality scenarios the study demonstrates that these models can detect direct quotations, allusions, and echoes between texts. The LLM's ability to generate novel intertextual observations and connections highlights its potential to uncover new insights. However, the model also struggles with long query passages and the inclusion of false intertextual dependences, emphasizing the importance of expert evaluation. The expert-in-the-loop methodology presented offers a scalable approach for intertextual research into the complex web of intertextuality within and beyond the biblical corpus.
☆ The Role of Large Language Models in Musicology: Are We Ready to Trust the Machines?
In this work, we explore the use and reliability of Large Language Models (LLMs) in musicology. From a discussion with experts and students, we assess the current acceptance and concerns regarding this, nowadays ubiquitous, technology. We aim to go one step further, proposing a semi-automatic method to create an initial benchmark using retrieval-augmented generation models and multiple-choice question generation, validated by human experts. Our evaluation on 400 human-validated questions shows that current vanilla LLMs are less reliable than retrieval augmented generation from music dictionaries. This paper suggests that the potential of LLMs in musicology requires musicology driven research that can specialized LLMs by including accurate and reliable domain knowledge.
☆ AgentRE: An Agent-Based Framework for Navigating Complex Information Landscapes in Relation Extraction CIKM 2024
The relation extraction (RE) in complex scenarios faces challenges such as diverse relation types and ambiguous relations between entities within a single sentence, leading to the poor performance of pure "text-in, text-out" language models (LMs). To address these challenges, in this paper, we propose an agent-based RE framework, namely AgentRE, which fully leverages the potential of large language models (LLMs) including memory, retrieval and reflection, to achieve RE in complex scenarios. Specifically, three major modules are built in AgentRE serving as the tools to help the agent acquire and process various useful information, thereby obtaining improved RE performance. Our extensive experimental results upon two datasets in English and Chinese demonstrate our AgentRE's superior performance, especially in low-resource scenarios. Additionally, the trajectories generated by AgentRE can be refined to construct a high-quality training dataset incorporating different reasoning methods, which can be used to fine-tune smaller models. Code is available at https://github.com/Lightblues/AgentRE.
comment: Accepted by CIKM 2024
☆ Towards Generative Class Prompt Learning for Few-shot Visual Recognition BMVC 2024
Although foundational vision-language models (VLMs) have proven to be very successful for various semantic discrimination tasks, they still struggle to perform faithfully for fine-grained categorization. Moreover, foundational models trained on one domain do not generalize well on a different domain without fine-tuning. We attribute these to the limitations of the VLM's semantic representations and attempt to improve their fine-grained visual awareness using generative modeling. Specifically, we propose two novel methods: Generative Class Prompt Learning (GCPL) and Contrastive Multi-class Prompt Learning (CoMPLe). Utilizing text-to-image diffusion models, GCPL significantly improves the visio-linguistic synergy in class embeddings by conditioning on few-shot exemplars with learnable class prompts. CoMPLe builds on this foundation by introducing a contrastive learning component that encourages inter-class separation during the generative optimization process. Our empirical results demonstrate that such a generative class prompt learning approach substantially outperform existing methods, offering a better alternative to few shot image recognition challenges. The source code will be made available at: https://github.com/soumitri2001/GCPL.
comment: Accepted at BMVC 2024
☆ Dialogue You Can Trust: Human and AI Perspectives on Generated Conversations ALT
As dialogue systems and chatbots increasingly integrate into everyday interactions, the need for efficient and accurate evaluation methods becomes paramount. This study explores the comparative performance of human and AI assessments across a range of dialogue scenarios, focusing on seven key performance indicators (KPIs): Coherence, Innovation, Concreteness, Goal Contribution, Commonsense Contradiction, Incorrect Fact, and Redundancy. Utilizing the GPT-4o API, we generated a diverse dataset of conversations and conducted a two-part experimental analysis. In Experiment 1, we evaluated multi-party conversations on Coherence, Innovation, Concreteness, and Goal Contribution, revealing that GPT models align closely with human judgments. Notably, both human and AI evaluators exhibited a tendency towards binary judgment rather than linear scaling, highlighting a shared challenge in these assessments. Experiment 2 extended the work of Finch et al. (2023) by focusing on dyadic dialogues and assessing Commonsense Contradiction, Incorrect Fact, and Redundancy. The results indicate that while GPT-4o demonstrates strong performance in maintaining factual accuracy and commonsense reasoning, it still struggles with reducing redundancy and self-contradiction. Our findings underscore the potential of GPT models to closely replicate human evaluation in dialogue systems, while also pointing to areas for improvement. This research offers valuable insights for advancing the development and implementation of more refined dialogue evaluation methodologies, contributing to the evolution of more effective and human-like AI communication tools.
comment: 17 pages, 15 figures, shorter version submitted to 22nd Annual Workshop of the Australasian Language Technology Association (ALTA'24)
☆ LASP: Surveying the State-of-the-Art in Large Language Model-Assisted AI Planning
Effective planning is essential for the success of any task, from organizing a vacation to routing autonomous vehicles and developing corporate strategies. It involves setting goals, formulating plans, and allocating resources to achieve them. LLMs are particularly well-suited for automated planning due to their strong capabilities in commonsense reasoning. They can deduce a sequence of actions needed to achieve a goal from a given state and identify an effective course of action. However, it is frequently observed that plans generated through direct prompting often fail upon execution. Our survey aims to highlight the existing challenges in planning with language models, focusing on key areas such as embodied environments, optimal scheduling, competitive and cooperative games, task decomposition, reasoning, and planning. Through this study, we explore how LLMs transform AI planning and provide unique insights into the future of LM-assisted planning.
☆ Training on the Benchmark Is Not All You Need
The success of Large Language Models (LLMs) relies heavily on the huge amount of pre-training data learned in the pre-training phase. The opacity of the pre-training process and the training data causes the results of many benchmark tests to become unreliable. If any model has been trained on a benchmark test set, it can seriously hinder the health of the field. In order to automate and efficiently test the capabilities of large language models, numerous mainstream benchmarks adopt a multiple-choice format. As the swapping of the contents of multiple-choice options does not affect the meaning of the question itself, we propose a simple and effective data leakage detection method based on this property. Specifically, we shuffle the contents of the options in the data to generate the corresponding derived data sets, and then detect data leakage based on the model's log probability distribution over the derived data sets. If there is a maximum and outlier in the set of log probabilities, it indicates that the data is leaked. Our method is able to work under black-box conditions without access to model training data or weights, effectively identifying data leakage from benchmark test sets in model pre-training data, including both normal scenarios and complex scenarios where options may have been shuffled intentionally or unintentionally. Through experiments based on two LLMs and benchmark designs, we demonstrate the effectiveness of our method. In addition, we evaluate the degree of data leakage of 31 mainstream open-source LLMs on four benchmark datasets and give a ranking of the leaked LLMs for each benchmark, and we find that the Qwen family of LLMs has the highest degree of data leakage.
☆ LLM-GAN: Construct Generative Adversarial Network Through Large Language Models For Explainable Fake News Detection
Explainable fake news detection predicts the authenticity of news items with annotated explanations. Today, Large Language Models (LLMs) are known for their powerful natural language understanding and explanation generation abilities. However, presenting LLMs for explainable fake news detection remains two main challenges. Firstly, fake news appears reasonable and could easily mislead LLMs, leaving them unable to understand the complex news-faking process. Secondly, utilizing LLMs for this task would generate both correct and incorrect explanations, which necessitates abundant labor in the loop. In this paper, we propose LLM-GAN, a novel framework that utilizes prompting mechanisms to enable an LLM to become Generator and Detector and for realistic fake news generation and detection. Our results demonstrate LLM-GAN's effectiveness in both prediction performance and explanation quality. We further showcase the integration of LLM-GAN to a cloud-native AI platform to provide better fake news detection service in the cloud.
☆ State-of-the-art Advances of Deep-learning Linguistic Steganalysis Research
With the evolution of generative linguistic steganography techniques, conventional steganalysis falls short in robustly quantifying the alterations induced by steganography, thereby complicating detection. Consequently, the research paradigm has pivoted towards deep-learning-based linguistic steganalysis. This study offers a comprehensive review of existing contributions and evaluates prevailing developmental trajectories. Specifically, we first provided a formalized exposition of the general formulas for linguistic steganalysis, while comparing the differences between this field and the domain of text classification. Subsequently, we classified the existing work into two levels based on vector space mapping and feature extraction models, thereby comparing the research motivations, model advantages, and other details. A comparative analysis of the experiments is conducted to assess the performances. Finally, the challenges faced by this field are discussed, and several directions for future development and key issues that urgently need to be addressed are proposed.
comment: Accepted by 2023 International Conference on Data, Information and Computing Science
☆ FC-KAN: Function Combinations in Kolmogorov-Arnold Networks
In this paper, we introduce FC-KAN, a Kolmogorov-Arnold Network (KAN) that leverages combinations of popular mathematical functions such as B-splines, wavelets, and radial basis functions on low-dimensional data through element-wise operations. We explore several methods for combining the outputs of these functions, including sum, element-wise product, the addition of sum and element-wise product, quadratic function representation, and concatenation. In our experiments, we compare FC-KAN with multi-layer perceptron network (MLP) and other existing KANs, such as BSRBF-KAN, EfficientKAN, FastKAN, and FasterKAN, on the MNIST and Fashion-MNIST datasets. A variant of FC-KAN, which uses a combination of outputs from B-splines and Difference of Gaussians (DoG) in the form of a quadratic function, outperformed all other models on the average of 5 independent training runs. We expect that FC-KAN can leverage function combinations to design future KANs. Our repository is publicly available at: https://github.com/hoangthangta/FC_KAN.
comment: 9 pages, 1 figure
☆ Empirical evidence of Large Language Model's influence on human spoken communication
Artificial Intelligence (AI) agents now interact with billions of humans in natural language, thanks to advances in Large Language Models (LLMs) like ChatGPT. This raises the question of whether AI has the potential to shape a fundamental aspect of human culture: the way we speak. Recent analyses revealed that scientific publications already exhibit evidence of AI-specific language. But this evidence is inconclusive, since scientists may simply be using AI to copy-edit their writing. To explore whether AI has influenced human spoken communication, we transcribed and analyzed about 280,000 English-language videos of presentations, talks, and speeches from more than 20,000 YouTube channels of academic institutions. We find a significant shift in the trend of word usage specific to words distinctively associated with ChatGPT following its release. These findings provide the first empirical evidence that humans increasingly imitate LLMs in their spoken language. Our results raise societal and policy-relevant concerns about the potential of AI to unintentionally reduce linguistic diversity, or to be deliberately misused for mass manipulation. They also highlight the need for further investigation into the feedback loops between machine behavior and human culture.
☆ Taming CLIP for Fine-grained and Structured Visual Understanding of Museum Exhibits ECCV 2024
CLIP is a powerful and widely used tool for understanding images in the context of natural language descriptions to perform nuanced tasks. However, it does not offer application-specific fine-grained and structured understanding, due to its generic nature. In this work, we aim to adapt CLIP for fine-grained and structured -- in the form of tabular data -- visual understanding of museum exhibits. To facilitate such understanding we (a) collect, curate, and benchmark a dataset of 200K+ image-table pairs, and (b) develop a method that allows predicting tabular outputs for input images. Our dataset is the first of its kind in the public domain. At the same time, the proposed method is novel in leveraging CLIP's powerful representations for fine-grained and tabular understanding. The proposed method (MUZE) learns to map CLIP's image embeddings to the tabular structure by means of a proposed transformer-based parsing network (parseNet). More specifically, parseNet enables prediction of missing attribute values while integrating context from known attribute-value pairs for an input image. We show that this leads to significant improvement in accuracy. Through exhaustive experiments, we show the effectiveness of the proposed method on fine-grained and structured understanding of museum exhibits, by achieving encouraging results in a newly established benchmark. Our dataset and source-code can be found at: https://github.com/insait-institute/MUZE
comment: Accepted to ECCV 2024
☆ In Defense of RAG in the Era of Long-Context Language Models
Overcoming the limited context limitations in early-generation LLMs, retrieval-augmented generation (RAG) has been a reliable solution for context-based answer generation in the past. Recently, the emergence of long-context LLMs allows the models to incorporate much longer text sequences, making RAG less attractive. Recent studies show that long-context LLMs significantly outperform RAG in long-context applications. Unlike the existing works favoring the long-context LLM over RAG, we argue that the extremely long context in LLMs suffers from a diminished focus on relevant information and leads to potential degradation in answer quality. This paper revisits the RAG in long-context answer generation. We propose an order-preserve retrieval-augmented generation (OP-RAG) mechanism, which significantly improves the performance of RAG for long-context question-answer applications. With OP-RAG, as the number of retrieved chunks increases, the answer quality initially rises, and then declines, forming an inverted U-shaped curve. There exist sweet points where OP-RAG could achieve higher answer quality with much less tokens than long-context LLM taking the whole context as input. Extensive experiments on public benchmark demonstrate the superiority of our OP-RAG.
☆ Interpreting and Improving Large Language Models in Arithmetic Calculation ICML 2024
Large language models (LLMs) have demonstrated remarkable potential across numerous applications and have shown an emergent ability to tackle complex reasoning tasks, such as mathematical computations. However, even for the simplest arithmetic calculations, the intrinsic mechanisms behind LLMs remain mysterious, making it challenging to ensure reliability. In this work, we delve into uncovering a specific mechanism by which LLMs execute calculations. Through comprehensive experiments, we find that LLMs frequently involve a small fraction (< 5%) of attention heads, which play a pivotal role in focusing on operands and operators during calculation processes. Subsequently, the information from these operands is processed through multi-layer perceptrons (MLPs), progressively leading to the final solution. These pivotal heads/MLPs, though identified on a specific dataset, exhibit transferability across different datasets and even distinct tasks. This insight prompted us to investigate the potential benefits of selectively fine-tuning these essential heads/MLPs to boost the LLMs' computational performance. We empirically find that such precise tuning can yield notable enhancements on mathematical prowess, without compromising the performance on non-mathematical tasks. Our work serves as a preliminary exploration into the arithmetic calculation abilities inherent in LLMs, laying a solid foundation to reveal more intricate mathematical tasks.
comment: Accepted by ICML 2024 (oral)
☆ From Yes-Men to Truth-Tellers: Addressing Sycophancy in Large Language Models with Pinpoint Tuning ICML 2024
Large Language Models (LLMs) tend to prioritize adherence to user prompts over providing veracious responses, leading to the sycophancy issue. When challenged by users, LLMs tend to admit mistakes and provide inaccurate responses even if they initially provided the correct answer. Recent works propose to employ supervised fine-tuning (SFT) to mitigate the sycophancy issue, while it typically leads to the degeneration of LLMs' general capability. To address the challenge, we propose a novel supervised pinpoint tuning (SPT), where the region-of-interest modules are tuned for a given objective. Specifically, SPT first reveals and verifies a small percentage (<5%) of the basic modules, which significantly affect a particular behavior of LLMs. i.e., sycophancy. Subsequently, SPT merely fine-tunes these identified modules while freezing the rest. To verify the effectiveness of the proposed SPT, we conduct comprehensive experiments, demonstrating that SPT significantly mitigates the sycophancy issue of LLMs (even better than SFT). Moreover, SPT introduces limited or even no side effects on the general capability of LLMs. Our results shed light on how to precisely, effectively, and efficiently explain and improve the targeted ability of LLMs.
comment: Accepted by ICML 2024
☆ CTG-KrEW: Generating Synthetic Structured Contextually Correlated Content by Conditional Tabular GAN with K-Means Clustering and Efficient Word Embedding
Conditional Tabular Generative Adversarial Networks (CTGAN) and their various derivatives are attractive for their ability to efficiently and flexibly create synthetic tabular data, showcasing strong performance and adaptability. However, there are certain critical limitations to such models. The first is their inability to preserve the semantic integrity of contextually correlated words or phrases. For instance, skillset in freelancer profiles is one such attribute where individual skills are semantically interconnected and indicative of specific domain interests or qualifications. The second challenge of traditional approaches is that, when applied to generate contextually correlated tabular content, besides generating semantically shallow content, they consume huge memory resources and CPU time during the training stage. To address these problems, we introduce a novel framework, CTGKrEW (Conditional Tabular GAN with KMeans Clustering and Word Embedding), which is adept at generating realistic synthetic tabular data where attributes are collections of semantically and contextually coherent words. CTGKrEW is trained and evaluated using a dataset from Upwork, a realworld freelancing platform. Comprehensive experiments were conducted to analyze the variability, contextual similarity, frequency distribution, and associativity of the generated data, along with testing the framework's system feasibility. CTGKrEW also takes around 99\% less CPU time and 33\% less memory footprints than the conventional approach. Furthermore, we developed KrEW, a web application to facilitate the generation of realistic data containing skill-related information. This application, available at https://riyasamanta.github.io/krew.html, is freely accessible to both the general public and the research community.
☆ Booster: Tackling Harmful Fine-tuing for Large Language Models via Attenuating Harmful Perturbation
Harmful fine-tuning issue \citep{qi2023fine} poses serious safety concerns for Large language models' fine-tuning-as-a-service. While existing defenses \citep{huang2024vaccine,rosati2024representation} have been proposed to mitigate the issue, their performances are still far away from satisfactory, and the root cause of the problem has not been fully recovered. For the first time in the literature, we in this paper show that \textit{harmful perturbation} over the model weights should be the root cause of alignment-broken of harmful fine-tuning. In order to attenuate the negative impact of harmful perturbation, we propose an alignment-stage solution, dubbed Booster. Technically, along with the original alignment loss, we append a loss regularizer in the alignment stage's optimization. The regularizer ensures that the model's harmful loss reduction before/after simulated harmful perturbation is attenuated, thereby mitigating the subsequent fine-tuning risk. Empirical results show that Booster can effectively reduce the harmful score of the fine-tuned models while maintaining the performance of downstream tasks. Our code is available at \url{https://github.com/git-disl/Booster}.
☆ Towards Cross-Lingual Explanation of Artwork in Large-scale Vision Language Models
As the performance of Large-scale Vision Language Models (LVLMs) improves, they are increasingly capable of responding in multiple languages, and there is an expectation that the demand for explanations generated by LVLMs will grow. However, pre-training of Vision Encoder and the integrated training of LLMs with Vision Encoder are mainly conducted using English training data, leaving it uncertain whether LVLMs can completely handle their potential when generating explanations in languages other than English. In addition, multilingual QA benchmarks that create datasets using machine translation have cultural differences and biases, remaining issues for use as evaluation tasks. To address these challenges, this study created an extended dataset in multiple languages without relying on machine translation. This dataset that takes into account nuances and country-specific phrases was then used to evaluate the generation explanation abilities of LVLMs. Furthermore, this study examined whether Instruction-Tuning in resource-rich English improves performance in other languages. Our findings indicate that LVLMs perform worse in languages other than English compared to English. In addition, it was observed that LVLMs struggle to effectively manage the knowledge learned from English data.
☆ AdaComp: Extractive Context Compression with Adaptive Predictor for Retrieval-Augmented Large Language Models
Retrieved documents containing noise will hinder RAG from detecting answer clues and make the inference process slow and expensive. Therefore, context compression is necessary to enhance its accuracy and efficiency. Existing context compression methods use extractive or generative models to retain the most query-relevant sentences or apply the information bottleneck theory to preserve sufficient information. However, these methods may face issues such as over-compression or high computational costs. We observe that the retriever often ranks relevant documents at the top, but the exact number of documents needed to answer the query is uncertain due to the impact of query complexity and retrieval quality: complex queries like multi-hop questions may require retaining more documents than simpler queries, and a low-quality retrieval may need to rely on more documents to generate accurate outputs. Therefore, determining the minimum number of required documents (compression rate) is still a challenge for RAG. In this paper, we introduce AdaComp, a low-cost extractive context compression method that adaptively determines the compression rate based on both query complexity and retrieval quality. Specifically, we first annotate the minimum top-k documents necessary for the RAG system to answer the current query as the compression rate and then construct triplets of the query, retrieved documents, and its compression rate. Then, we use this triplet dataset to train a compression-rate predictor. Experiments on three QA datasets and one conversational Muiti-doc QA dataset show that AdaComp significantly reduces inference costs while maintaining performance nearly identical to uncompressed models, achieving a balance between efficiency and performance.
comment: 8 pages, 5 figures, code available at https://anonymous.4open.science/r/AdaComp-8C0C/
☆ An Implementation of Werewolf Agent That does not Truly Trust LLMs
Werewolf is an incomplete information game, which has several challenges when creating a computer agent as a player given the lack of understanding of the situation and individuality of utterance (e.g., computer agents are not capable of characterful utterance or situational lying). We propose a werewolf agent that solves some of those difficulties by combining a Large Language Model (LLM) and a rule-based algorithm. In particular, our agent uses a rule-based algorithm to select an output either from an LLM or a template prepared beforehand based on the results of analyzing conversation history using an LLM. It allows the agent to refute in specific situations, identify when to end the conversation, and behave with persona. This approach mitigated conversational inconsistencies and facilitated logical utterance as a result. We also conducted a qualitative evaluation, which resulted in our agent being perceived as more human-like compared to an unmodified LLM. The agent is freely available for contributing to advance the research in the field of Werewolf game.
☆ Benchmarking Cognitive Domains for LLMs: Insights from Taiwanese Hakka Culture
This study introduces a comprehensive benchmark designed to evaluate the performance of large language models (LLMs) in understanding and processing cultural knowledge, with a specific focus on Hakka culture as a case study. Leveraging Bloom's Taxonomy, the study develops a multi-dimensional framework that systematically assesses LLMs across six cognitive domains: Remembering, Understanding, Applying, Analyzing, Evaluating, and Creating. This benchmark extends beyond traditional single-dimensional evaluations by providing a deeper analysis of LLMs' abilities to handle culturally specific content, ranging from basic recall of facts to higher-order cognitive tasks such as creative synthesis. Additionally, the study integrates Retrieval-Augmented Generation (RAG) technology to address the challenges of minority cultural knowledge representation in LLMs, demonstrating how RAG enhances the models' performance by dynamically incorporating relevant external information. The results highlight the effectiveness of RAG in improving accuracy across all cognitive domains, particularly in tasks requiring precise retrieval and application of cultural knowledge. However, the findings also reveal the limitations of RAG in creative tasks, underscoring the need for further optimization. This benchmark provides a robust tool for evaluating and comparing LLMs in culturally diverse contexts, offering valuable insights for future research and development in AI-driven cultural knowledge preservation and dissemination.
comment: Submitted to O-COCOSDA 2024
☆ Self-Instructed Derived Prompt Generation Meets In-Context Learning: Unlocking New Potential of Black-Box LLMs
Large language models (LLMs) have shown success in generating high-quality responses. In order to achieve better alignment with LLMs with human preference, various works are proposed based on specific optimization process, which, however, is not suitable to Black-Box LLMs like GPT-4, due to inaccessible parameters. In Black-Box LLMs case, their performance is highly dependent on the quality of the provided prompts. Existing methods to enhance response quality often involve a prompt refinement model, yet these approaches potentially suffer from semantic inconsistencies between the refined and original prompts, and typically overlook the relationship between them. To address these challenges, we introduce a self-instructed in-context learning framework that empowers LLMs to deliver more effective responses by generating reliable derived prompts to construct informative contextual environments. Our approach incorporates a self-instructed reinforcement learning mechanism, enabling direct interaction with the response model during derived prompt generation for better alignment. We then formulate querying as an in-context learning task, using responses from LLMs combined with the derived prompts to establish a contextual demonstration for the original prompt. This strategy ensures alignment with the original query, reduces discrepancies from refined prompts, and maximizes the LLMs' in-context learning capability. Extensive experiments demonstrate that the proposed method not only generates more reliable derived prompts but also significantly enhances LLMs' ability to deliver more effective responses, including Black-Box models such as GPT-4.
☆ VoxHakka: A Dialectally Diverse Multi-speaker Text-to-Speech System for Taiwanese Hakka
This paper introduces VoxHakka, a text-to-speech (TTS) system designed for Taiwanese Hakka, a critically under-resourced language spoken in Taiwan. Leveraging the YourTTS framework, VoxHakka achieves high naturalness and accuracy and low real-time factor in speech synthesis while supporting six distinct Hakka dialects. This is achieved by training the model with dialect-specific data, allowing for the generation of speaker-aware Hakka speech. To address the scarcity of publicly available Hakka speech corpora, we employed a cost-effective approach utilizing a web scraping pipeline coupled with automatic speech recognition (ASR)-based data cleaning techniques. This process ensured the acquisition of a high-quality, multi-speaker, multi-dialect dataset suitable for TTS training. Subjective listening tests conducted using comparative mean opinion scores (CMOS) demonstrate that VoxHakka significantly outperforms existing publicly available Hakka TTS systems in terms of pronunciation accuracy, tone correctness, and overall naturalness. This work represents a significant advancement in Hakka language technology and provides a valuable resource for language preservation and revitalization efforts.
comment: Submitted to O-COCOSDA 2024
☆ Effective Noise-aware Data Simulation for Domain-adaptive Speech Enhancement Leveraging Dynamic Stochastic Perturbation
Cross-domain speech enhancement (SE) is often faced with severe challenges due to the scarcity of noise and background information in an unseen target domain, leading to a mismatch between training and test conditions. This study puts forward a novel data simulation method to address this issue, leveraging noise-extractive techniques and generative adversarial networks (GANs) with only limited target noisy speech data. Notably, our method employs a noise encoder to extract noise embeddings from target-domain data. These embeddings aptly guide the generator to synthesize utterances acoustically fitted to the target domain while authentically preserving the phonetic content of the input clean speech. Furthermore, we introduce the notion of dynamic stochastic perturbation, which can inject controlled perturbations into the noise embeddings during inference, thereby enabling the model to generalize well to unseen noise conditions. Experiments on the VoiceBank-DEMAND benchmark dataset demonstrate that our domain-adaptive SE method outperforms an existing strong baseline based on data simulation.
comment: Accepted to IEEE SLT 2024
☆ It is Time to Develop an Auditing Framework to Promote Value Aware Chatbots
The launch of ChatGPT in November 2022 marked the beginning of a new era in AI, the availability of generative AI tools for everyone to use. ChatGPT and other similar chatbots boast a wide range of capabilities from answering student homework questions to creating music and art. Given the large amounts of human data chatbots are built on, it is inevitable that they will inherit human errors and biases. These biases have the potential to inflict significant harm or increase inequity on different subpopulations. Because chatbots do not have an inherent understanding of societal values, they may create new content that is contrary to established norms. Examples of concerning generated content includes child pornography, inaccurate facts, and discriminatory posts. In this position paper, we argue that the speed of advancement of this technology requires us, as computer and data scientists, to mobilize and develop a values-based auditing framework containing a community established standard set of measurements to monitor the health of different chatbots and LLMs. To support our argument, we use a simple audit template to share the results of basic audits we conduct that are focused on measuring potential bias in search engine style tasks, code generation, and story generation. We identify responses from GPT 3.5 and GPT 4 that are both consistent and not consistent with values derived from existing law. While the findings come as no surprise, they do underscore the urgency of developing a robust auditing framework for openly sharing results in a consistent way so that mitigation strategies can be developed by the academic community, government agencies, and companies when our values are not being adhered to. We conclude this paper with recommendations for value-based strategies for improving the technologies.
comment: arXiv admin note: substantial text overlap with arXiv:2306.07500
☆ S$^3$c-Math: Spontaneous Step-level Self-correction Makes Large Language Models Better Mathematical Reasoners
Self-correction is a novel method that can stimulate the potential reasoning abilities of large language models (LLMs). It involves detecting and correcting errors during the inference process when LLMs solve reasoning problems. However, recent works do not regard self-correction as a spontaneous and intrinsic capability of LLMs. Instead, such correction is achieved through post-hoc generation, external knowledge introduction, multi-model collaboration, and similar techniques. In this paper, we propose a series of mathematical LLMs called S$^3$c-Math, which are able to perform Spontaneous Step-level Self-correction for Mathematical reasoning. This capability helps LLMs to recognize whether their ongoing inference tends to contain errors and simultaneously correct these errors to produce a more reliable response. We proposed a method, which employs a step-level sampling approach to construct step-wise self-correction data for achieving such ability. Additionally, we implement a training strategy that uses above constructed data to equip LLMs with spontaneous step-level self-correction capacities. Our data and methods have been demonstrated to be effective across various foundation LLMs, consistently showing significant progress in evaluations on GSM8K, MATH, and other mathematical benchmarks. To the best of our knowledge, we are the first to introduce the spontaneous step-level self-correction ability of LLMs in mathematical reasoning.
♻ ☆ Improving Rare Word Translation With Dictionaries and Attention Masking
In machine translation, rare words continue to be a problem for the dominant encoder-decoder architecture, especially in low-resource and out-of-domain translation settings. Human translators solve this problem with monolingual or bilingual dictionaries. In this paper, we propose appending definitions from a bilingual dictionary to source sentences and using attention masking to link together rare words with their definitions. We find that including definitions for rare words improves performance by up to 1.0 BLEU and 1.6 MacroF1.
comment: 11 pages, 3 figures, 3 tables. Accepted at AMTA 2024
♻ ☆ Low-Rank Quantization-Aware Training for LLMs
Large language models (LLMs) are omnipresent, however their practical deployment is challenging due to their ever increasing computational and memory demands. Quantization is one of the most effective ways to make them more compute and memory efficient. Quantization-aware training (QAT) methods, generally produce the best quantized performance, however it comes at the cost of potentially long training time and excessive memory usage, making it impractical when applying for LLMs. Inspired by parameter-efficient fine-tuning (PEFT) and low-rank adaptation (LoRA) literature, we propose LR-QAT -- a lightweight and memory-efficient QAT algorithm for LLMs. LR-QAT employs several components to save memory without sacrificing predictive performance: (a) low-rank auxiliary weights that are aware of the quantization grid; (b) a downcasting operator using fixed-point or double-packed integers and (c) checkpointing. Unlike most related work, our method (i) is inference-efficient, leading to no additional overhead compared to traditional PTQ; (ii) can be seen as a general extended pretraining framework, meaning that the resulting model can still be utilized for any downstream task afterwards; (iii) can be applied across a wide range of quantization settings, such as different choices quantization granularity, activation quantization, and seamlessly combined with many PTQ techniques. We apply LR-QAT to LLaMA-1/2/3 and Mistral model families and validate its effectiveness on several downstream tasks. Our method outperforms common post-training quantization (PTQ) approaches and reaches the same model performance as full-model QAT at the fraction of its memory usage. Specifically, we can train a 7B LLM on a single consumer grade GPU with 24GB of memory. Our source code is available at https://github.com/qualcomm-ai-research/LR-QAT
♻ ☆ Foundation Models for Music: A Survey
In recent years, foundation models (FMs) such as large language models (LLMs) and latent diffusion models (LDMs) have profoundly impacted diverse sectors, including music. This comprehensive review examines state-of-the-art (SOTA) pre-trained models and foundation models in music, spanning from representation learning, generative learning and multimodal learning. We first contextualise the significance of music in various industries and trace the evolution of AI in music. By delineating the modalities targeted by foundation models, we discover many of the music representations are underexplored in FM development. Then, emphasis is placed on the lack of versatility of previous methods on diverse music applications, along with the potential of FMs in music understanding, generation and medical application. By comprehensively exploring the details of the model pre-training paradigm, architectural choices, tokenisation, finetuning methodologies and controllability, we emphasise the important topics that should have been well explored, like instruction tuning and in-context learning, scaling law and emergent ability, as well as long-sequence modelling etc. A dedicated section presents insights into music agents, accompanied by a thorough analysis of datasets and evaluations essential for pre-training and downstream tasks. Finally, by underscoring the vital importance of ethical considerations, we advocate that following research on FM for music should focus more on such issues as interpretability, transparency, human responsibility, and copyright issues. The paper offers insights into future challenges and trends on FMs for music, aiming to shape the trajectory of human-AI collaboration in the music realm.
♻ ☆ InkubaLM: A small language model for low-resource African languages
High-resource language models often fall short in the African context, where there is a critical need for models that are efficient, accessible, and locally relevant, even amidst significant computing and data constraints. This paper introduces InkubaLM, a small language model with 0.4 billion parameters, which achieves performance comparable to models with significantly larger parameter counts and more extensive training data on tasks such as machine translation, question-answering, AfriMMLU, and the AfriXnli task. Notably, InkubaLM outperforms many larger models in sentiment analysis and demonstrates remarkable consistency across multiple languages. This work represents a pivotal advancement in challenging the conventional paradigm that effective language models must rely on substantial resources. Our model and datasets are publicly available at https://huggingface.co/lelapa to encourage research and development on low-resource languages.
♻ ☆ OceanGPT: A Large Language Model for Ocean Science Tasks ACL2024
Ocean science, which delves into the oceans that are reservoirs of life and biodiversity, is of great significance given that oceans cover over 70% of our planet's surface. Recently, advances in Large Language Models (LLMs) have transformed the paradigm in science. Despite the success in other domains, current LLMs often fall short in catering to the needs of domain experts like oceanographers, and the potential of LLMs for ocean science is under-explored. The intrinsic reasons are the immense and intricate nature of ocean data as well as the necessity for higher granularity and richness in knowledge. To alleviate these issues, we introduce OceanGPT, the first-ever large language model in the ocean domain, which is expert in various ocean science tasks. We also propose OceanGPT, a novel framework to automatically obtain a large volume of ocean domain instruction data, which generates instructions based on multi-agent collaboration. Additionally, we construct the first oceanography benchmark, OceanBench, to evaluate the capabilities of LLMs in the ocean domain. Though comprehensive experiments, OceanGPT not only shows a higher level of knowledge expertise for oceans science tasks but also gains preliminary embodied intelligence capabilities in ocean technology.
comment: ACL2024. Project Website: http://oceangpt.zjukg.cn/
♻ ☆ A Survey on Stability of Learning with Limited Labelled Data and its Sensitivity to the Effects of Randomness
Learning with limited labelled data, such as prompting, in-context learning, fine-tuning, meta-learning or few-shot learning, aims to effectively train a model using only a small amount of labelled samples. However, these approaches have been observed to be excessively sensitive to the effects of uncontrolled randomness caused by non-determinism in the training process. The randomness negatively affects the stability of the models, leading to large variances in results across training runs. When such sensitivity is disregarded, it can unintentionally, but unfortunately also intentionally, create an imaginary perception of research progress. Recently, this area started to attract research attention and the number of relevant studies is continuously growing. In this survey, we provide a comprehensive overview of 415 papers addressing the effects of randomness on the stability of learning with limited labelled data. We distinguish between four main tasks addressed in the papers (investigate/evaluate; determine; mitigate; benchmark/compare/report randomness effects), providing findings for each one. Furthermore, we identify and discuss seven challenges and open problems together with possible directions to facilitate further research. The ultimate goal of this survey is to emphasise the importance of this growing research area, which so far has not received an appropriate level of attention, and reveal impactful directions for future research.
comment: Accepted to ACM Comput. Surv. 2024
♻ ☆ Towards Scalable Automated Alignment of LLMs: A Survey
Alignment is the most critical step in building large language models (LLMs) that meet human needs. With the rapid development of LLMs gradually surpassing human capabilities, traditional alignment methods based on human-annotation are increasingly unable to meet the scalability demands. Therefore, there is an urgent need to explore new sources of automated alignment signals and technical approaches. In this paper, we systematically review the recently emerging methods of automated alignment, attempting to explore how to achieve effective, scalable, automated alignment once the capabilities of LLMs exceed those of humans. Specifically, we categorize existing automated alignment methods into 4 major categories based on the sources of alignment signals and discuss the current status and potential development of each category. Additionally, we explore the underlying mechanisms that enable automated alignment and discuss the essential factors that make automated alignment technologies feasible and effective from the fundamental role of alignment.
comment: Paper List: https://github.com/cascip/awesome-auto-alignment
♻ ☆ White-Box Transformers via Sparse Rate Reduction: Compression Is All There Is?
In this paper, we contend that a natural objective of representation learning is to compress and transform the distribution of the data, say sets of tokens, towards a low-dimensional Gaussian mixture supported on incoherent subspaces. The goodness of such a representation can be evaluated by a principled measure, called sparse rate reduction, that simultaneously maximizes the intrinsic information gain and extrinsic sparsity of the learned representation. From this perspective, popular deep network architectures, including transformers, can be viewed as realizing iterative schemes to optimize this measure. Particularly, we derive a transformer block from alternating optimization on parts of this objective: the multi-head self-attention operator compresses the representation by implementing an approximate gradient descent step on the coding rate of the features, and the subsequent multi-layer perceptron sparsifies the features. This leads to a family of white-box transformer-like deep network architectures, named CRATE, which are mathematically fully interpretable. We show, by way of a novel connection between denoising and compression, that the inverse to the aforementioned compressive encoding can be realized by the same class of CRATE architectures. Thus, the so-derived white-box architectures are universal to both encoders and decoders. Experiments show that these networks, despite their simplicity, indeed learn to compress and sparsify representations of large-scale real-world image and text datasets, and achieve performance very close to highly engineered transformer-based models: ViT, MAE, DINO, BERT, and GPT2. We believe the proposed computational framework demonstrates great potential in bridging the gap between theory and practice of deep learning, from a unified perspective of data compression. Code is available at: https://ma-lab-berkeley.github.io/CRATE .
comment: Accepted at Journal of Machine Learning Research. This paper integrates the works arXiv:2306.01129 and arXiv:2308.16271 into a complete story. In this paper, we improve the writing and organization, and also add conceptual, empirical, and theoretical improvements over the previous work. V2: small typo fixes and formatting improvements. V3: improvements from journal revisions
♻ ☆ A Fundamental Trade-off in Aligned Language Models and its Relation to Sampling Adaptors
The relationship between the quality of a string, as judged by a human reader, and its probability, $p(\boldsymbol{y})$ under a language model undergirds the development of better language models. For example, many popular algorithms for sampling from a language model have been conceived with the goal of manipulating $p(\boldsymbol{y})$ to place higher probability on strings that humans deem of high quality. In this article, we examine the probability--quality relationship in language models explicitly aligned to human preferences, e.g., through reinforcement learning through human feedback. We show that, when sampling corpora from an aligned language model, there exists a trade-off between the strings' average reward and average log-likelihood under the prior language model, i.e., the same model before alignment with human preferences. We provide a formal treatment of this phenomenon and demonstrate how a choice of sampling adaptor allows for a selection of how much likelihood we exchange for the reward.
♻ ☆ Correcting misinformation on social media with a large language model
Real-world misinformation, often multimodal, can be partially or fully factual but misleading using diverse tactics like conflating correlation with causation. Such misinformation is severely understudied, challenging to address, and harms various social domains, particularly on social media, where it can spread rapidly. High-quality and timely correction of misinformation that identifies and explains its (in)accuracies effectively reduces false beliefs. Despite the wide acceptance of manual correction, it is difficult to be timely and scalable. While LLMs have versatile capabilities that could accelerate misinformation correction, they struggle due to a lack of recent information, a tendency to produce false content, and limitations in addressing multimodal information. We propose MUSE, an LLM augmented with access to and credibility evaluation of up-to-date information. By retrieving evidence as refutations or supporting context, MUSE identifies and explains content (in)accuracies with references. It conducts multimodal retrieval and interprets visual content to verify and correct multimodal content. Given the absence of a comprehensive evaluation approach, we propose 13 dimensions of misinformation correction quality. Then, fact-checking experts evaluate responses to social media content that are not presupposed to be misinformation but broadly include (partially) incorrect and correct posts that may (not) be misleading. Results demonstrate MUSE's ability to write high-quality responses to potential misinformation--across modalities, tactics, domains, political leanings, and for information that has not previously been fact-checked online--within minutes of its appearance on social media. Overall, MUSE outperforms GPT-4 by 37% and even high-quality responses from laypeople by 29%. Our work provides a general methodological and evaluative framework to correct misinformation at scale.
comment: 50 pages
♻ ☆ NeMo-Aligner: Scalable Toolkit for Efficient Model Alignment
Aligning Large Language Models (LLMs) with human values and preferences is essential for making them helpful and safe. However, building efficient tools to perform alignment can be challenging, especially for the largest and most competent LLMs which often contain tens or hundreds of billions of parameters. We create NeMo-Aligner, a toolkit for model alignment that can efficiently scale to a thousand GPUs for training the largest open-source LLMs such as Nemotron 4 340B and Llama 3.1 405B. NeMo-Aligner comes with highly optimized and scalable implementations for major paradigms of model alignment such as: Reinforcement Learning from Human Feedback (RLHF), Direct Preference Optimization (DPO), SteerLM, and Self-Play Fine-Tuning (SPIN). Additionally, our toolkit supports running most of the alignment techniques in a Parameter Efficient Fine-Tuning (PEFT) setting. NeMo-Aligner is designed for extensibility, allowing support for other alignment techniques with minimal effort. It is open-sourced with Apache 2.0 License and we invite community contributions at https://github.com/NVIDIA/NeMo-Aligner
comment: 16 pages, 4 figures, Accepted to COLM 2024
♻ ☆ Squid: Long Context as a New Modality for Energy-Efficient On-Device Language Models
This paper presents Dolphin, a novel decoder-decoder architecture for energy-efficient processing of long contexts in language models. Our approach addresses the significant energy consumption and latency challenges inherent in on-device models. Dolphin employs a compact 0.5B parameter decoder to distill extensive contextual information into a memory embedding, substantially reducing the input length for the primary 7B parameter decoder model. Inspired by vision-language models, we repurpose the image embedding projector to encode long textual contexts, effectively treating extended context as a distinct modality. This innovative method enables processing of substantially longer contexts without the typical computational overhead associated with extended input sequences. Empirical evaluations demonstrate a 10-fold improvement in energy efficiency and a 5-fold reduction in latency compared to conventional full-length context processing methods without losing quality of the response. Our work contributes to the development of more sustainable and scalable language models for on-device applications, addressing the critical need for energy-efficient and responsive AI technologies in resource-constrained environments while maintaining the accuracy to understand long contexts. This research has implications for the broader field of natural language processing, particularly in the domain of efficient model design for resource-limited settings. By enabling more sophisticated AI capabilities on edge devices, Dolphin paves the way for advanced language processing in a wide range of applications where computational resources are at a premium. The Dolphin model is publicly available at https://huggingface.co/NexaAIDev/Dolphin.
♻ ☆ OccamLLM: Fast and Exact Language Model Arithmetic in a Single Step
Despite significant advancements in text generation and reasoning, Large Language Models (LLMs) still face challenges in accurately performing complex arithmetic operations. Language model systems often enable LLMs to generate code for arithmetic operations to achieve accurate calculations. However, this approach compromises speed and security, and fine-tuning risks the language model losing prior capabilities. We propose a framework that enables exact arithmetic in a single autoregressive step, providing faster, more secure, and more interpretable LLM systems with arithmetic capabilities. We use the hidden states of a LLM to control a symbolic architecture that performs arithmetic. Our implementation using Llama 3 with OccamNet as a symbolic model (OccamLlama) achieves 100\% accuracy on single arithmetic operations ($+,-,\times,\div,\sin{},\cos{},\log{},\exp{},\sqrt{}$), outperforming GPT 4o with and without a code interpreter. Furthermore, OccamLlama outperforms GPT 4o with and without a code interpreter on average across a range of mathematical problem solving benchmarks, demonstrating that OccamLLMs can excel in arithmetic tasks, even surpassing much larger models. We will make our code public shortly.
♻ ☆ Flood of Techniques and Drought of Theories: Emotion Mining in Disasters
Emotion mining has become a crucial tool for understanding human emotions during disasters, leveraging the extensive data generated on social media platforms. This paper aims to summarize existing research on emotion mining within disaster contexts, highlighting both significant discoveries and persistent issues. On the one hand, emotion mining techniques have achieved acceptable accuracy enabling applications such as rapid damage assessment and mental health surveillance. On the other hand, with many studies adopting data-driven approaches, several methodological issues remain. These include arbitrary emotion classification, ignoring biases inherent in data collection from social media, such as the overrepresentation of individuals from higher socioeconomic status on Twitter, and the lack of application of theoretical frameworks like cross-cultural comparisons. These problems can be summarized as a notable lack of theory-driven research and ignoring insights from social and behavioral sciences. This paper underscores the need for interdisciplinary collaboration between computer scientists and social scientists to develop more robust and theoretically grounded approaches in emotion mining. By addressing these gaps, we aim to enhance the effectiveness and reliability of emotion mining methodologies, ultimately contributing to improved disaster preparedness, response, and recovery. Keywords: emotion mining, sentiment analysis, natural disasters, psychology, technological disasters
♻ ☆ How Far Are We on the Decision-Making of LLMs? Evaluating LLMs' Gaming Ability in Multi-Agent Environments
Decision-making, a complicated task requiring various types of abilities, presents an excellent framework for assessing Large Language Models (LLMs). Our research investigates decision-making capabilities of LLMs through the lens of Game Theory. We focus specifically on games that support the simultaneous participation of more than two agents. We introduce GAMA($\gamma$)-Bench, which evaluates LLMs' Gaming Ability in Multi-Agent environments. $\gamma$-Bench includes eight classical multi-agent games and a scoring scheme specially designed to quantitatively assess LLMs' performance. Leveraging $\gamma$-Bench, we investigate LLMs' robustness, generalizability, and strategies for enhancement. Results reveal that while GPT-3.5 shows satisfying robustness, its generalizability is relatively limited. However, its performance can be improved through approaches such as Chain-of-Thought. Additionally, we evaluate twelve versions from six models, including GPT-3.5, GPT-4, Gemini, LLaMA-3.1, Mixtral, and Qwen-2. We find that Gemini-1.5-Pro outperforms other models with a score of $63.8$ out of $100$, followed by LLaMA-3.1-70B and GPT-4 with scores of $60.9$ and $60.5$, respectively. The code and experimental results are made publicly available via https://github.com/CUHK-ARISE/GAMABench.
comment: 11 pages of main text. 20 pages of appendices. 12 figures, 9 tables. Added models: Gemini-1.5-Pro, LLaMA-3.1-{7, 70, 405}B, Mixtral-8x{7, 22}B, Qwen-2-72B
♻ ☆ The Responsible Foundation Model Development Cheatsheet: A Review of Tools & Resources
Foundation model development attracts a rapidly expanding body of contributors, scientists, and applications. To help shape responsible development practices, we introduce the Foundation Model Development Cheatsheet: a growing collection of 250+ tools and resources spanning text, vision, and speech modalities. We draw on a large body of prior work to survey resources (e.g. software, documentation, frameworks, guides, and practical tools) that support informed data selection, processing, and understanding, precise and limitation-aware artifact documentation, efficient model training, advance awareness of the environmental impact from training, careful model evaluation of capabilities, risks, and claims, as well as responsible model release, licensing and deployment practices. We hope this curated collection of resources helps guide more responsible development. The process of curating this list, enabled us to review the AI development ecosystem, revealing what tools are critically missing, misused, or over-used in existing practices. We find that (i) tools for data sourcing, model evaluation, and monitoring are critically under-serving ethical and real-world needs, (ii) evaluations for model safety, capabilities, and environmental impact all lack reproducibility and transparency, (iii) text and particularly English-centric analyses continue to dominate over multilingual and multi-modal analyses, and (iv) evaluation of systems, rather than just models, is needed so that capabilities and impact are assessed in context.
♻ ☆ COFFEE: A Contrastive Oracle-Free Framework for Event Extraction ATC
Event extraction is a complex information extraction task that involves extracting events from unstructured text. Prior classification-based methods require comprehensive entity annotations for joint training, while newer generation-based methods rely on heuristic templates containing oracle information such as event type, which is often unavailable in real-world scenarios. In this study, we consider a more realistic setting of this task, namely the Oracle-Free Event Extraction (OFEE) task, where only the input context is given without any oracle information, including event type, event ontology and trigger word. To solve this task, we propose a new framework, called COFFEE, which extracts the events solely based on the document context without referring to any oracle information. In particular, a contrastive selection model is introduced in COFFEE to rectify the generated triggers and handle multi-event instances. The proposed COFFEE outperforms state-of-the-art approaches under the oracle-free setting of the event extraction task, as evaluated on a public event extraction benchmark ACE05.
comment: Accepted to MATCHING Workshop at ACL 2023
♻ ☆ Persian Slang Text Conversion to Formal and Deep Learning of Persian Short Texts on Social Media for Sentiment Classification
The lack of a suitable tool for the analysis of conversational texts in the Persian language has made various analyses of these texts, including Sentiment Analysis, difficult. In this research, we tried to make the understanding of these texts easier for the machine by providing PSC, Persian Slang Converter, a tool for converting conversational texts into formal ones, and by using the most up-to-date and best deep learning methods along with the PSC, the sentiment learning of short Persian language texts for the machine in a better way. be made More than 10 million unlabeled texts from various social networks and movie subtitles (as Conversational texts) and about 10 million news texts (as formal texts) have been used for training unsupervised models and formal implementation of the tool. 60,000 texts from the comments of Instagram social network users with positive, negative, and neutral labels are considered supervised data for training the emotion classification model of short texts. Using the formal tool, 57% of the words of the corpus of conversation were converted. Finally, by using the formalizer, FastText model, and deep LSTM network, an accuracy of 81.91 was obtained on the test data.
comment: 16 pages, 4 figures, 14 tables
♻ ☆ Preference Learning Algorithms Do Not Learn Preference Rankings
Preference learning algorithms (e.g., RLHF and DPO) are frequently used to steer LLMs to produce generations that are more preferred by humans, but our understanding of their inner workings is still limited. In this work, we study the conventional wisdom that preference learning trains models to assign higher likelihoods to more preferred outputs than less preferred outputs, measured via $\textit{ranking accuracy}$. Surprisingly, we find that most state-of-the-art preference-tuned models achieve a ranking accuracy of less than 60% on common preference datasets. We furthermore derive the $\textit{idealized ranking accuracy}$ that a preference-tuned LLM would achieve if it optimized the DPO or RLHF objective perfectly. We demonstrate that existing models exhibit a significant $\textit{alignment gap}$ -- $\textit{i.e.}$, a gap between the observed and idealized ranking accuracies. We attribute this discrepancy to the DPO objective, which is empirically and theoretically ill-suited to fix even mild ranking errors in the reference model, and derive a simple and efficient formula for quantifying the difficulty of learning a given preference datapoint. Finally, we demonstrate that ranking accuracy strongly correlates with the empirically popular win rate metric when the model is close to the reference model used in the objective, shedding further light on the differences between on-policy (e.g., RLHF) and off-policy (e.g., DPO) preference learning algorithms.
♻ ☆ A Voter-Based Stochastic Rejection-Method Framework for Asymptotically Safe Language Model Outputs
This paper proposes a new method for preventing unsafe or otherwise low quality large language model (LLM) outputs, by leveraging the stochasticity of LLMs. We propose a system whereby LLM checkers vote on the acceptability of a generated output, regenerating it if a threshold of disapproval is reached, until sufficient checkers approve. We further propose estimators for cost and failure rate, and based on those estimators and experimental data tailored to the application, we propose an algorithm that achieves a desired failure rate at the least possible cost. We demonstrate that, under these models, failure rate decreases exponentially as a function of cost when voter count and threshold are chosen according to the algorithm, and that the models reasonably estimate the actual performance of such a system in action, even with limited data.
comment: 7 pages, 2 figures
♻ ☆ ARN: Analogical Reasoning on Narratives
As a core cognitive skill that enables the transferability of information across domains, analogical reasoning has been extensively studied for both humans and computational models. However, while cognitive theories of analogy often focus on narratives and study the distinction between surface, relational, and system similarities, existing work in natural language processing has a narrower focus as far as relational analogies between word pairs. This gap brings a natural question: can state-of-the-art large language models (LLMs) detect system analogies between narratives? To gain insight into this question and extend word-based relational analogies to relational system analogies, we devise a comprehensive computational framework that operationalizes dominant theories of analogy, using narrative elements to create surface and system mappings. Leveraging the interplay between these mappings, we create a binary task and benchmark for Analogical Reasoning on Narratives (ARN), covering four categories of far (cross-domain)/near (within-domain) analogies and disanalogies. We show that while all LLMs can largely recognize near analogies, even the largest ones struggle with far analogies in a zero-shot setting, with GPT4.0 scoring below random. Guiding the models through solved examples and chain-of-thought reasoning enhances their analogical reasoning ability. Yet, since even in the few-shot setting, the best model only performs halfway between random and humans, ARN opens exciting directions for computational analogical reasoners.
♻ ☆ Zyda: A 1.3T Dataset for Open Language Modeling
The size of large language models (LLMs) has scaled dramatically in recent years and their computational and data requirements have surged correspondingly. State-of-the-art language models, even at relatively smaller sizes, typically require training on at least a trillion tokens. This rapid advancement has eclipsed the growth of open-source datasets available for large-scale LLM pretraining. In this paper, we introduce Zyda (Zyphra Dataset), a dataset under a permissive license comprising 1.3 trillion tokens, assembled by integrating several major respected open-source datasets into a single, high-quality corpus. We apply rigorous filtering and deduplication processes, both within and across datasets, to maintain and enhance the quality derived from the original datasets. Our evaluations show that Zyda not only competes favorably with other open datasets like Dolma, FineWeb, and RefinedWeb, but also substantially improves the performance of comparable models from the Pythia suite. Our rigorous data processing methods significantly enhance Zyda's effectiveness, outperforming even the best of its constituent datasets when used independently.
♻ ☆ Correction with Backtracking Reduces Hallucination in Summarization
Abstractive summarization aims at generating natural language summaries of a source document that are succinct while preserving the important elements. Despite recent advances, neural text summarization models are known to be susceptible to hallucinating (or more correctly confabulating), that is to produce summaries with details that are not grounded in the source document. In this paper, we introduce a simple yet efficient technique, CoBa, to reduce hallucination in abstractive summarization. The approach is based on two steps: hallucination detection and mitigation. We show that the former can be achieved through measuring simple statistics about conditional word probabilities and distance to context words. Further, we demonstrate that straight-forward backtracking is surprisingly effective at mitigation. We thoroughly evaluate the proposed method with prior art on three benchmark datasets for text summarization. The results show that CoBa is effective and efficient in reducing hallucination, and offers great adaptability and flexibility. Code can be found at https://github.com/zhenzhel/CoBa.
♻ ☆ Investigating the Robustness of LLMs on Math Word Problems
Large Language Models (LLMs) excel at various tasks, including solving math word problems (MWPs), but struggle with real-world problems containing irrelevant information. To address this, we propose a prompting framework that generates adversarial variants of MWPs by adding irrelevant variables. We introduce a dataset, ProbleMATHIC, containing both adversarial and non-adversarial MWPs. Our experiments reveal that LLMs are susceptible to distraction by numerical noise, resulting in an average relative performance drop of ~26% on adversarial MWPs. To mitigate this, we fine-tune LLMs (Llama-2, Mistral) on the adversarial samples from our dataset. Fine-tuning on adversarial training instances improves performance on adversarial MWPs by ~8%, indicating increased robustness to noise and better ability to identify relevant data for reasoning. Finally, to assess the generalizability of our prompting framework, we introduce GSM-8K-Adv, an adversarial variant of the GSM-8K benchmark. LLMs continue to struggle when faced with adversarial information, reducing performance by up to ~6%.
♻ ☆ A Survey on Responsible Generative AI: What to Generate and What Not
In recent years, generative AI (GenAI), like large language models and text-to-image models, has received significant attention across various domains. However, ensuring the responsible generation of content by these models is crucial for their real-world applicability. This raises an interesting question: What should responsible GenAI generate, and what should it not? To answer the question, this paper investigates the practical responsible requirements of both textual and visual generative models, outlining five key considerations: generating truthful content, avoiding toxic content, refusing harmful instruction, leaking no training data-related content, and ensuring generated content identifiable. Specifically, we review recent advancements and challenges in addressing these requirements. Besides, we discuss and emphasize the importance of responsible GenAI across healthcare, education, finance, and artificial general intelligence domains. Through a unified perspective on both textual and visual generative models, this paper aims to provide insights into practical safety-related issues and further benefit the community in building responsible GenAI.
comment: 77 pages, 10 figures
♻ ☆ GPT has become financially literate: Insights from financial literacy tests of GPT and a preliminary test of how people use it as a source of advice
We assess the ability of GPT -- a large language model -- to serve as a financial robo-advisor for the masses, by using a financial literacy test. Davinci and ChatGPT based on GPT-3.5 score 66% and 65% on the financial literacy test, respectively, compared to a baseline of 33%. However, ChatGPT based on GPT-4 achieves a near-perfect 99% score, pointing to financial literacy becoming an emergent ability of state-of-the-art models. We use the Judge-Advisor System and a savings dilemma to illustrate how researchers might assess advice-utilization from large language models. We also present a number of directions for future research.
comment: 43 pages, 2 figures and 2 tables in main text; in V2 added information that this is the Author Accepted Manuscript version
♻ ☆ Interpretation of Intracardiac Electrograms Through Textual Representations
Understanding the irregular electrical activity of atrial fibrillation (AFib) has been a key challenge in electrocardiography. For serious cases of AFib, catheter ablations are performed to collect intracardiac electrograms (EGMs). EGMs offer intricately detailed and localized electrical activity of the heart and are an ideal modality for interpretable cardiac studies. Recent advancements in artificial intelligence (AI) has allowed some works to utilize deep learning frameworks to interpret EGMs during AFib. Additionally, language models (LMs) have shown exceptional performance in being able to generalize to unseen domains, especially in healthcare. In this study, we are the first to leverage pretrained LMs for finetuning of EGM interpolation and AFib classification via masked language modeling. We formulate the EGM as a textual sequence and present competitive performances on AFib classification compared against other representations. Lastly, we provide a comprehensive interpretability study to provide a multi-perspective intuition of the model's behavior, which could greatly benefit the clinical use.
comment: 17 pages, 7 figures; Accepted to CHIL 2024
♻ ☆ RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback ICML 2024
Reinforcement learning from human feedback (RLHF) has proven effective in aligning large language models (LLMs) with human preferences, but gathering high-quality preference labels is expensive. RL from AI Feedback (RLAIF), introduced in Bai et al., offers a promising alternative that trains the reward model (RM) on preferences generated by an off-the-shelf LLM. Across the tasks of summarization, helpful dialogue generation, and harmless dialogue generation, we show that RLAIF achieves comparable performance to RLHF. Furthermore, we take a step towards "self-improvement" by demonstrating that RLAIF can outperform a supervised fine-tuned baseline even when the AI labeler is the same size as the policy, or even the exact same checkpoint as the initial policy. Finally, we introduce direct-RLAIF (d-RLAIF) - a technique that circumvents RM training by obtaining rewards directly from an off-the-shelf LLM during RL, which achieves superior performance to canonical RLAIF. Our results suggest that RLAIF can achieve performance on-par with using human feedback, offering a potential solution to the scalability limitations of RLHF.
comment: Presented at ICML 2024
Computer Vision and Pattern Recognition
☆ Coaching a Robotic Sonographer: Learning Robotic Ultrasound with Sparse Expert's Feedback
Ultrasound is widely employed for clinical intervention and diagnosis, due to its advantages of offering non-invasive, radiation-free, and real-time imaging. However, the accessibility of this dexterous procedure is limited due to the substantial training and expertise required of operators. The robotic ultrasound (RUS) offers a viable solution to address this limitation; nonetheless, achieving human-level proficiency remains challenging. Learning from demonstrations (LfD) methods have been explored in RUS, which learns the policy prior from a dataset of offline demonstrations to encode the mental model of the expert sonographer. However, active engagement of experts, i.e. Coaching, during the training of RUS has not been explored thus far. Coaching is known for enhancing efficiency and performance in human training. This paper proposes a coaching framework for RUS to amplify its performance. The framework combines DRL (self-supervised practice) with sparse expert's feedback through coaching. The DRL employs an off-policy Soft Actor-Critic (SAC) network, with a reward based on image quality rating. The coaching by experts is modeled as a Partially Observable Markov Decision Process (POMDP), which updates the policy parameters based on the correction by the expert. The validation study on phantoms showed that coaching increases the learning rate by $25\%$ and the number of high-quality image acquisition by $74.5\%$.
comment: Accepted in IEEE Transactions on Medical Robotics and Bionics (TMRB) 2024
☆ What Do You See in Common? Learning Hierarchical Prototypes over Tree-of-Life to Discover Evolutionary Traits
A grand challenge in biology is to discover evolutionary traits - features of organisms common to a group of species with a shared ancestor in the tree of life (also referred to as phylogenetic tree). With the growing availability of image repositories in biology, there is a tremendous opportunity to discover evolutionary traits directly from images in the form of a hierarchy of prototypes. However, current prototype-based methods are mostly designed to operate over a flat structure of classes and face several challenges in discovering hierarchical prototypes, including the issue of learning over-specific features at internal nodes. To overcome these challenges, we introduce the framework of Hierarchy aligned Commonality through Prototypical Networks (HComP-Net). We empirically show that HComP-Net learns prototypes that are accurate, semantically consistent, and generalizable to unseen species in comparison to baselines on birds, butterflies, and fishes datasets. The code and datasets are available at https://github.com/Imageomics/HComPNet.
comment: 34 pages, 27 figures
☆ YoloTag: Vision-based Robust UAV Navigation with Fiducial Markers
By harnessing fiducial markers as visual landmarks in the environment, Unmanned Aerial Vehicles (UAVs) can rapidly build precise maps and navigate spaces safely and efficiently, unlocking their potential for fluent collaboration and coexistence with humans. Existing fiducial marker methods rely on handcrafted feature extraction, which sacrifices accuracy. On the other hand, deep learning pipelines for marker detection fail to meet real-time runtime constraints crucial for navigation applications. In this work, we propose YoloTag \textemdash a real-time fiducial marker-based localization system. YoloTag uses a lightweight YOLO v8 object detector to accurately detect fiducial markers in images while meeting the runtime constraints needed for navigation. The detected markers are then used by an efficient perspective-n-point algorithm to estimate UAV states. However, this localization system introduces noise, causing instability in trajectory tracking. To suppress noise, we design a higher-order Butterworth filter that effectively eliminates noise through frequency domain analysis. We evaluate our algorithm through real-robot experiments in an indoor environment, comparing the trajectory tracking performance of our method against other approaches in terms of several distance metrics.
☆ Visual Servoing for Robotic On-Orbit Servicing: A Survey
On-orbit servicing (OOS) activities will power the next big step for sustainable exploration and commercialization of space. Developing robotic capabilities for autonomous OOS operations is a priority for the space industry. Visual Servoing (VS) enables robots to achieve the precise manoeuvres needed for critical OOS missions by utilizing visual information for motion control. This article presents an overview of existing VS approaches for autonomous OOS operations with space manipulator systems (SMS). We divide the approaches according to their contribution to the typical phases of a robotic OOS mission: a) Recognition, b) Approach, and c) Contact. We also present a discussion on the reviewed VS approaches, identifying current trends. Finally, we highlight the challenges and areas for future research on VS techniques for robotic OOS.
comment: Accepted for publication at the 2024 International Conference on Space Robotics (iSpaRo)
☆ Geometry-aware Feature Matching for Large-Scale Structure from Motion
Establishing consistent and dense correspondences across multiple images is crucial for Structure from Motion (SfM) systems. Significant view changes, such as air-to-ground with very sparse view overlap, pose an even greater challenge to the correspondence solvers. We present a novel optimization-based approach that significantly enhances existing feature matching methods by introducing geometry cues in addition to color cues. This helps fill gaps when there is less overlap in large-scale scenarios. Our method formulates geometric verification as an optimization problem, guiding feature matching within detector-free methods and using sparse correspondences from detector-based methods as anchor points. By enforcing geometric constraints via the Sampson Distance, our approach ensures that the denser correspondences from detector-free methods are geometrically consistent and more accurate. This hybrid strategy significantly improves correspondence density and accuracy, mitigates multi-view inconsistencies, and leads to notable advancements in camera pose accuracy and point cloud density. It outperforms state-of-the-art feature matching methods on benchmark datasets and enables feature matching in challenging extreme large-scale settings.
☆ QID$^2$: An Image-Conditioned Diffusion Model for Q-space Up-sampling of DWI Data MICCAI 2024
We propose an image-conditioned diffusion model to estimate high angular resolution diffusion weighted imaging (DWI) from a low angular resolution acquisition. Our model, which we call QID$^2$, takes as input a set of low angular resolution DWI data and uses this information to estimate the DWI data associated with a target gradient direction. We leverage a U-Net architecture with cross-attention to preserve the positional information of the reference images, further guiding the target image generation. We train and evaluate QID$^2$ on single-shell DWI samples curated from the Human Connectome Project (HCP) dataset. Specifically, we sub-sample the HCP gradient directions to produce low angular resolution DWI data and train QID$^2$ to reconstruct the missing high angular resolution samples. We compare QID$^2$ with two state-of-the-art GAN models. Our results demonstrate that QID$^2$ not only achieves higher-quality generated images, but it consistently outperforms the GAN models in downstream tensor estimation across multiple metrics. Taken together, this study highlights the potential of diffusion models, and QID$^2$ in particular, for q-space up-sampling, thus offering a promising toolkit for clinical and research applications.
comment: Accepted at MICCAI 2024 International Workshop on Computational Diffusion MRI. Zijian Chen and Jueqi Wang contributed equally to this work
☆ Unsupervised Welding Defect Detection Using Audio And Video
In this work we explore the application of AI to robotic welding. Robotic welding is a widely used technology in many industries, but robots currently do not have the capability to detect welding defects which get introduced due to various reasons in the welding process. We describe how deep-learning methods can be applied to detect weld defects in real-time by recording the welding process with microphones and a camera. Our findings are based on a large database with more than 4000 welding samples we collected which covers different weld types, materials and various defect categories. All deep learning models are trained in an unsupervised fashion because the space of possible defects is large and the defects in our data may contain biases. We demonstrate that a reliable real-time detection of most categories of weld defects is feasible both from audio and video, with improvements achieved by combining both modalities. Specifically, the multi-modal approach achieves an average Area-under-ROC-Curve (AUC) of 0.92 over all eleven defect types in our data. We conclude the paper with an analysis of the results by defect type and a discussion of future work.
comment: 21 pages
☆ Biochemical Prostate Cancer Recurrence Prediction: Thinking Fast & Slow
Time to biochemical recurrence in prostate cancer is essential for prognostic monitoring of the progression of patients after prostatectomy, which assesses the efficacy of the surgery. In this work, we proposed to leverage multiple instance learning through a two-stage ``thinking fast \& slow'' strategy for the time to recurrence (TTR) prediction. The first (``thinking fast'') stage finds the most relevant WSI area for biochemical recurrence and the second (``thinking slow'') stage leverages higher resolution patches to predict TTR. Our approach reveals a mean C-index ($Ci$) of 0.733 ($\theta=0.059$) on our internal validation and $Ci=0.603$ on the LEOPARD challenge validation set. Post hoc attention visualization shows that the most attentive area contributes to the TTR prediction.
comment: 8 pages, 3 figures, methodology paper for LEOPRARD Challenge
☆ K-Origins: Better Colour Quantification for Neural Networks
K-Origins is a neural network layer designed to improve image-based network performances when learning colour, or intensities, is beneficial. Over 250 encoder-decoder convolutional networks are trained and tested on 16-bit synthetic data, demonstrating that K-Origins improves semantic segmentation accuracy in two scenarios: object detection with low signal-to-noise ratios, and segmenting multiple objects that are identical in shape but vary in colour. K-Origins generates output features from the input features, $\textbf{X}$, by the equation $\textbf{Y}_k = \textbf{X}-\textbf{J}\cdot w_k$ for each trainable parameter $w_k$, where $\textbf{J}$ is a matrix of ones. Additionally, networks with varying receptive fields were trained to determine optimal network depths based on the dimensions of target classes, suggesting that receptive field lengths should exceed object sizes. By ensuring a sufficient receptive field length and incorporating K-Origins, we can achieve better semantic network performance.
comment: 16 pages, 13 figures, 1 table
☆ Evaluation and Comparison of Visual Language Models for Transportation Engineering Problems
Recent developments in vision language models (VLM) have shown great potential for diverse applications related to image understanding. In this study, we have explored state-of-the-art VLM models for vision-based transportation engineering tasks such as image classification and object detection. The image classification task involves congestion detection and crack identification, whereas, for object detection, helmet violations were identified. We have applied open-source models such as CLIP, BLIP, OWL-ViT, Llava-Next, and closed-source GPT-4o to evaluate the performance of these state-of-the-art VLM models to harness the capabilities of language understanding for vision-based transportation tasks. These tasks were performed by applying zero-shot prompting to the VLM models, as zero-shot prompting involves performing tasks without any training on those tasks. It eliminates the need for annotated datasets or fine-tuning for specific tasks. Though these models gave comparative results with benchmark Convolutional Neural Networks (CNN) models in the image classification tasks, for object localization tasks, it still needs improvement. Therefore, this study provides a comprehensive evaluation of the state-of-the-art VLM models highlighting the advantages and limitations of the models, which can be taken as the baseline for future improvement and wide-scale implementation.
☆ ADHD diagnosis based on action characteristics recorded in videos using machine learning
Demand for ADHD diagnosis and treatment is increasing significantly and the existing services are unable to meet the demand in a timely manner. In this work, we introduce a novel action recognition method for ADHD diagnosis by identifying and analysing raw video recordings. Our main contributions include 1) designing and implementing a test focusing on the attention and hyperactivity/impulsivity of participants, recorded through three cameras; 2) implementing a novel machine learning ADHD diagnosis system based on action recognition neural networks for the first time; 3) proposing classification criteria to provide diagnosis results and analysis of ADHD action characteristics.
comment: Neuroscience Applied
☆ Action-Based ADHD Diagnosis in Video
Attention Deficit Hyperactivity Disorder (ADHD) causes significant impairment in various domains. Early diagnosis of ADHD and treatment could significantly improve the quality of life and functioning. Recently, machine learning methods have improved the accuracy and efficiency of the ADHD diagnosis process. However, the cost of the equipment and trained staff required by the existing methods are generally huge. Therefore, we introduce the video-based frame-level action recognition network to ADHD diagnosis for the first time. We also record a real multi-modal ADHD dataset and extract three action classes from the video modality for ADHD diagnosis. The whole process data have been reported to CNTW-NHS Foundation Trust, which would be reviewed by medical consultants/professionals and will be made public in due course.
comment: 31st European Symposium on Artificial Neural Networks
☆ Optimal L-Systems for Stochastic L-system Inference Problems
This paper presents two novel theorems that address two open problems in stochastic Lindenmayer-system (L-system) inference, specifically focusing on the construction of an optimal stochastic L-system capable of generating a given sequence of strings. The first theorem delineates a method for crafting a stochastic L-system that maximizes the likelihood of producing a given sequence of words through a singular derivation. Furthermore, the second theorem determines the stochastic L-systems with the highest probability of producing a given sequence of words with multiple possible derivations. From these, we introduce an algorithm to infer an optimal stochastic L-system from a given sequence. This algorithm incorporates sophisticated optimization techniques, such as interior point methods, ensuring production of a stochastically optimal stochastic L-system suitable for generating the given sequence. This allows for the use of using stochastic L-systems as model for machine learning using only positive data for training.
☆ How to Determine the Preferred Image Distribution of a Black-Box Vision-Language Model?
Large foundation models have revolutionized the field, yet challenges remain in optimizing multi-modal models for specialized visual tasks. We propose a novel, generalizable methodology to identify preferred image distributions for black-box Vision-Language Models (VLMs) by measuring output consistency across varied input prompts. Applying this to different rendering types of 3D objects, we demonstrate its efficacy across various domains requiring precise interpretation of complex structures, with a focus on Computer-Aided Design (CAD) as an exemplar field. We further refine VLM outputs using in-context learning with human feedback, significantly enhancing explanation quality. To address the lack of benchmarks in specialized domains, we introduce CAD-VQA, a new dataset for evaluating VLMs on CAD-related visual question answering tasks. Our evaluation of state-of-the-art VLMs on CAD-VQA establishes baseline performance levels, providing a framework for advancing VLM capabilities in complex visual reasoning tasks across various fields requiring expert-level visual interpretation. We release the dataset and evaluation codes at \url{https://github.com/asgsaeid/cad_vqa}.
☆ NoiseAttack: An Evasive Sample-Specific Multi-Targeted Backdoor Attack Through White Gaussian Noise
Backdoor attacks pose a significant threat when using third-party data for deep learning development. In these attacks, data can be manipulated to cause a trained model to behave improperly when a specific trigger pattern is applied, providing the adversary with unauthorized advantages. While most existing works focus on designing trigger patterns in both visible and invisible to poison the victim class, they typically result in a single targeted class upon the success of the backdoor attack, meaning that the victim class can only be converted to another class based on the adversary predefined value. In this paper, we address this issue by introducing a novel sample-specific multi-targeted backdoor attack, namely NoiseAttack. Specifically, we adopt White Gaussian Noise (WGN) with various Power Spectral Densities (PSD) as our underlying triggers, coupled with a unique training strategy to execute the backdoor attack. This work is the first of its kind to launch a vision backdoor attack with the intent to generate multiple targeted classes with minimal input configuration. Furthermore, our extensive experimental results demonstrate that NoiseAttack can achieve a high attack success rate against popular network architectures and datasets, as well as bypass state-of-the-art backdoor detection methods. Our source code and experiments are available at https://github.com/SiSL-URI/NoiseAttack/tree/main.
☆ A Novel Audio-Visual Information Fusion System for Mental Disorders Detection
Mental disorders are among the foremost contributors to the global healthcare challenge. Research indicates that timely diagnosis and intervention are vital in treating various mental disorders. However, the early somatization symptoms of certain mental disorders may not be immediately evident, often resulting in their oversight and misdiagnosis. Additionally, the traditional diagnosis methods incur high time and cost. Deep learning methods based on fMRI and EEG have improved the efficiency of the mental disorder detection process. However, the cost of the equipment and trained staff are generally huge. Moreover, most systems are only trained for a specific mental disorder and are not general-purpose. Recently, physiological studies have shown that there are some speech and facial-related symptoms in a few mental disorders (e.g., depression and ADHD). In this paper, we focus on the emotional expression features of mental disorders and introduce a multimodal mental disorder diagnosis system based on audio-visual information input. Our proposed system is based on spatial-temporal attention networks and innovative uses a less computationally intensive pre-train audio recognition network to fine-tune the video recognition module for better results. We also apply the unified system for multiple mental disorders (ADHD and depression) for the first time. The proposed system achieves over 80\% accuracy on the real multimodal ADHD dataset and achieves state-of-the-art results on the depression dataset AVEC 2014.
comment: 27th International Conference on Information (FUSION)
☆ What makes a face looks like a hat: Decoupling low-level and high-level Visual Properties with Image Triplets ECCV2024
In visual decision making, high-level features, such as object categories, have a strong influence on choice. However, the impact of low-level features on behavior is less understood partly due to the high correlation between high- and low-level features in the stimuli presented (e.g., objects of the same category are more likely to share low-level features). To disentangle these effects, we propose a method that de-correlates low- and high-level visual properties in a novel set of stimuli. Our method uses two Convolutional Neural Networks (CNNs) as candidate models of the ventral visual stream: the CORnet-S that has high neural predictivity in high-level, IT-like responses and the VGG-16 that has high neural predictivity in low-level responses. Triplets (root, image1, image2) of stimuli are parametrized by the level of low- and high-level similarity of images extracted from the different layers. These stimuli are then used in a decision-making task where participants are tasked to choose the most similar-to-the-root image. We found that different networks show differing abilities to predict the effects of low-versus-high-level similarity: while CORnet-S outperforms VGG-16 in explaining human choices based on high-level similarity, VGG-16 outperforms CORnet-S in explaining human choices based on low-level similarity. Using Brain-Score, we observed that the behavioral prediction abilities of different layers of these networks qualitatively corresponded to their ability to explain neural activity at different levels of the visual hierarchy. In summary, our algorithm for stimulus set generation enables the study of how different representations in the visual stream affect high-level cognitive behaviors.
comment: Accepted at Workshop on Human-inspired Computer Vision @ ECCV2024
☆ EgoPressure: A Dataset for Hand Pressure and Pose Estimation in Egocentric Vision
Estimating touch contact and pressure in egocentric vision is a central task for downstream applications in Augmented Reality, Virtual Reality, as well as many robotic applications, because it provides precise physical insights into hand-object interaction and object manipulation. However, existing contact pressure datasets lack egocentric views and hand poses, which are essential for accurate estimation during in-situ operation, both for AR/VR interaction and robotic manipulation. In this paper, we introduce EgoPressure,a novel dataset of touch contact and pressure interaction from an egocentric perspective, complemented with hand pose meshes and fine-grained pressure intensities for each contact. The hand poses in our dataset are optimized using our proposed multi-view sequence-based method that processes footage from our capture rig of 8 accurately calibrated RGBD cameras. EgoPressure comprises 5.0 hours of touch contact and pressure interaction from 21 participants captured by a moving egocentric camera and 7 stationary Kinect cameras, which provided RGB images and depth maps at 30 Hz. In addition, we provide baselines for estimating pressure with different modalities, which will enable future developments and benchmarking on the dataset. Overall, we demonstrate that pressure and hand poses are complementary, which supports our intention to better facilitate the physical understanding of hand-object interactions in AR/VR and robotics research.
☆ Visually Grounded Speech Models for Low-resource Languages and Cognitive Modelling
This dissertation examines visually grounded speech (VGS) models that learn from unlabelled speech paired with images. It focuses on applications for low-resource languages and understanding human language acquisition. We introduce a task called visually prompted keyword localisation to detect and localise keywords in speech using images. We demonstrate the effectiveness of VGS models in few-shot learning scenarios for low-resource languages like Yoruba. Additionally, we examine the mutual exclusivity bias in VGS models. Our monolingual VGS model exhibits this bias, but we found that multilingualism does not affect the bias in this VGS model similarly to what is observed in children.
comment: PhD Dissertation
☆ Unveiling Deep Shadows: A Survey on Image and Video Shadow Detection, Removal, and Generation in the Era of Deep Learning
Shadows are formed when light encounters obstacles, leading to areas of diminished illumination. In computer vision, shadow detection, removal, and generation are crucial for enhancing scene understanding, refining image quality, ensuring visual consistency in video editing, and improving virtual environments. This paper presents a comprehensive survey of shadow detection, removal, and generation in images and videos within the deep learning landscape over the past decade, covering tasks, deep models, datasets, and evaluation metrics. Our key contributions include a comprehensive survey of shadow analysis, standardization of experimental comparisons, exploration of the relationships among model size, speed, and performance, a cross-dataset generalization study, identification of open issues and future directions, and provision of publicly available resources to support further research.
comment: Publicly available results, trained models, and evaluation metrics at https://github.com/xw-hu/Unveiling-Deep-Shadows
☆ DynOMo: Online Point Tracking by Dynamic Online Monocular Gaussian Reconstruction
Reconstructing scenes and tracking motion are two sides of the same coin. Tracking points allow for geometric reconstruction [14], while geometric reconstruction of (dynamic) scenes allows for 3D tracking of points over time [24, 39]. The latter was recently also exploited for 2D point tracking to overcome occlusion ambiguities by lifting tracking directly into 3D [38]. However, above approaches either require offline processing or multi-view camera setups both unrealistic for real-world applications like robot navigation or mixed reality. We target the challenge of online 2D and 3D point tracking from unposed monocular camera input introducing Dynamic Online Monocular Reconstruction (DynOMo). We leverage 3D Gaussian splatting to reconstruct dynamic scenes in an online fashion. Our approach extends 3D Gaussians to capture new content and object motions while estimating camera movements from a single RGB frame. DynOMo stands out by enabling emergence of point trajectories through robust image feature reconstruction and a novel similarity-enhanced regularization term, without requiring any correspondence-level supervision. It sets the first baseline for online point tracking with monocular unposed cameras, achieving performance on par with existing methods. We aim to inspire the community to advance online point tracking and reconstruction, expanding the applicability to diverse real-world scenarios.
☆ Towards Real-World Adverse Weather Image Restoration: Enhancing Clearness and Semantics with Vision-Language Models ECCV 2024
This paper addresses the limitations of adverse weather image restoration approaches trained on synthetic data when applied to real-world scenarios. We formulate a semi-supervised learning framework employing vision-language models to enhance restoration performance across diverse adverse weather conditions in real-world settings. Our approach involves assessing image clearness and providing semantics using vision-language models on real data, serving as supervision signals for training restoration models. For clearness enhancement, we use real-world data, utilizing a dual-step strategy with pseudo-labels assessed by vision-language models and weather prompt learning. For semantic enhancement, we integrate real-world data by adjusting weather conditions in vision-language model descriptions while preserving semantic meaning. Additionally, we introduce an effective training strategy to bootstrap restoration performance. Our approach achieves superior results in real-world adverse weather image restoration, demonstrated through qualitative and quantitative comparisons with state-of-the-art works.
comment: Accepted by ECCV 2024
☆ LinFusion: 1 GPU, 1 Minute, 16K Image
Modern diffusion models, particularly those utilizing a Transformer-based UNet for denoising, rely heavily on self-attention operations to manage complex spatial relationships, thus achieving impressive generation performance. However, this existing paradigm faces significant challenges in generating high-resolution visual content due to its quadratic time and memory complexity with respect to the number of spatial tokens. To address this limitation, we aim at a novel linear attention mechanism as an alternative in this paper. Specifically, we begin our exploration from recently introduced models with linear complexity, e.g., Mamba, Mamba2, and Gated Linear Attention, and identify two key features-attention normalization and non-causal inference-that enhance high-resolution visual generation performance. Building on these insights, we introduce a generalized linear attention paradigm, which serves as a low-rank approximation of a wide spectrum of popular linear token mixers. To save the training cost and better leverage pre-trained models, we initialize our models and distill the knowledge from pre-trained StableDiffusion (SD). We find that the distilled model, termed LinFusion, achieves performance on par with or superior to the original SD after only modest training, while significantly reducing time and memory complexity. Extensive experiments on SD-v1.5, SD-v2.1, and SD-XL demonstrate that LinFusion delivers satisfactory zero-shot cross-resolution generation performance, generating high-resolution images like 16K resolution. Moreover, it is highly compatible with pre-trained SD components, such as ControlNet and IP-Adapter, requiring no adaptation efforts. Codes are available at https://github.com/Huage001/LinFusion.
comment: Work in Progress. Codes are available at https://github.com/Huage001/LinFusion
☆ DepthCrafter: Generating Consistent Long Depth Sequences for Open-world Videos
Despite significant advancements in monocular depth estimation for static images, estimating video depth in the open world remains challenging, since open-world videos are extremely diverse in content, motion, camera movement, and length. We present DepthCrafter, an innovative method for generating temporally consistent long depth sequences with intricate details for open-world videos, without requiring any supplementary information such as camera poses or optical flow. DepthCrafter achieves generalization ability to open-world videos by training a video-to-depth model from a pre-trained image-to-video diffusion model, through our meticulously designed three-stage training strategy with the compiled paired video-depth datasets. Our training approach enables the model to generate depth sequences with variable lengths at one time, up to 110 frames, and harvest both precise depth details and rich content diversity from realistic and synthetic datasets. We also propose an inference strategy that processes extremely long videos through segment-wise estimation and seamless stitching. Comprehensive evaluations on multiple datasets reveal that DepthCrafter achieves state-of-the-art performance in open-world video depth estimation under zero-shot settings. Furthermore, DepthCrafter facilitates various downstream applications, including depth-based visual effects and conditional video generation.
comment: Project webpage: https://depthcrafter.github.io
☆ GraspSplats: Efficient Manipulation with 3D Feature Splatting
The ability for robots to perform efficient and zero-shot grasping of object parts is crucial for practical applications and is becoming prevalent with recent advances in Vision-Language Models (VLMs). To bridge the 2D-to-3D gap for representations to support such a capability, existing methods rely on neural fields (NeRFs) via differentiable rendering or point-based projection methods. However, we demonstrate that NeRFs are inappropriate for scene changes due to their implicitness and point-based methods are inaccurate for part localization without rendering-based optimization. To amend these issues, we propose GraspSplats. Using depth supervision and a novel reference feature computation method, GraspSplats generates high-quality scene representations in under 60 seconds. We further validate the advantages of Gaussian-based representation by showing that the explicit and optimized geometry in GraspSplats is sufficient to natively support (1) real-time grasp sampling and (2) dynamic and articulated object manipulation with point trackers. With extensive experiments on a Franka robot, we demonstrate that GraspSplats significantly outperforms existing methods under diverse task settings. In particular, GraspSplats outperforms NeRF-based methods like F3RM and LERF-TOGO, and 2D detection methods.
comment: Project webpage: https://graspsplats.github.io/
☆ Physical Rule-Guided Convolutional Neural Network
The black-box nature of Convolutional Neural Networks (CNNs) and their reliance on large datasets limit their use in complex domains with limited labeled data. Physics-Guided Neural Networks (PGNNs) have emerged to address these limitations by integrating scientific principles and real-world knowledge, enhancing model interpretability and efficiency. This paper proposes a novel Physics-Guided CNN (PGCNN) architecture that incorporates dynamic, trainable, and automated LLM-generated, widely recognized rules integrated into the model as custom layers to address challenges like limited data and low confidence scores. The PGCNN is evaluated on multiple datasets, demonstrating superior performance compared to a baseline CNN model. Key improvements include a significant reduction in false positives and enhanced confidence scores for true detection. The results highlight the potential of PGCNNs to improve CNN performance for broader application areas.
♻ ☆ Open-vocabulary Temporal Action Localization using VLMs
Video action localization aims to find timings of a specific action from a long video. Although existing learning-based approaches have been successful, those require annotating videos that come with a considerable labor cost. This paper proposes a learning-free, open-vocabulary approach based on emerging off-the-shelf vision-language models (VLM). The challenge stems from the fact that VLMs are neither designed to process long videos nor tailored for finding actions. We overcome these problems by extending an iterative visual prompting technique. Specifically, we sample video frames into a concatenated image with frame index labels, making a VLM guess a frame that is considered to be closest to the start/end of the action. Iterating this process by narrowing a sampling time window results in finding a specific frame of start and end of an action. We demonstrate that this sampling technique yields reasonable results, illustrating a practical extension of VLMs for understanding videos. A sample code is available at https://microsoft.github.io/VLM-Video-Action-Localization/.
comment: 7 pages, 5 figures, 4 tables. Last updated on September 3rd, 2024
♻ ☆ A Multiscale Gradient Fusion Method for Edge Detection in Color Images Utilizing the CBM3D Filter
In this paper, a color edge detection strategy based on collaborative filtering combined with multiscale gradient fusion is proposed. The block-matching and 3D (BM3D) filter are used to enhance the sparse representation in the transform domain and achieve the effect of denoising, whereas the multiscale gradient fusion makes up for the defect of loss of details in single-scale edge detection and improves the edge detection resolution and quality. First, the RGB images in the dataset are converted to XYZ color space images through mathematical operations. Second, the colored block-matching and 3D (CBM3D) filter are used on the sparse images and to remove noise interference. Then, the vector gradients of the color image and the anisotropic Gaussian directional derivative of the two scale parameters are calculated and averaged pixel-by-pixel to obtain a new edge strength map. Finally, the edge features are enhanced by image normalization and non-maximum suppression technology, and on that basis, the edge contour is obtained by double threshold selection and a new morphological refinement method. Through an experimental analysis of the edge detection dataset, the method proposed has good noise robustness and high edge quality, which is better than the Color Sobel, Color Canny, SE and Color AGDD as shown by the PR curve, AUC, PSNR, MSE, and FOM indicators.
comment: 1 figure, 2 tables
♻ ☆ IBO: Inpainting-Based Occlusion to Enhance Explainable Artificial Intelligence Evaluation in Histopathology
Histopathological image analysis is crucial for accurate cancer diagnosis and treatment planning. While deep learning models, especially convolutional neural networks, have advanced this field, their "black-box" nature raises concerns about interpretability and trustworthiness. Explainable Artificial Intelligence (XAI) techniques aim to address these concerns, but evaluating their effectiveness remains challenging. A significant issue with current occlusion-based XAI methods is that they often generate Out-of-Distribution (OoD) samples, leading to inaccurate evaluations. In this paper, we introduce Inpainting-Based Occlusion (IBO), a novel occlusion strategy that utilizes a Denoising Diffusion Probabilistic Model to inpaint occluded regions in histopathological images. By replacing cancerous areas with realistic, non-cancerous tissue, IBO minimizes OoD artifacts and preserves data integrity. We evaluate our method on the CAMELYON16 dataset through two phases: first, by assessing perceptual similarity using the Learned Perceptual Image Patch Similarity (LPIPS) metric, and second, by quantifying the impact on model predictions through Area Under the Curve (AUC) analysis. Our results demonstrate that IBO significantly improves perceptual fidelity, achieving nearly twice the improvement in LPIPS scores compared to the best existing occlusion strategy. Additionally, IBO increased the precision of XAI performance prediction from 42% to 71% compared to traditional methods. These results demonstrate IBO's potential to provide more reliable evaluations of XAI techniques, benefiting histopathology and other applications. The source code for this study is available at https://github.com/a-fsh-r/IBO.
comment: 19 pages, 6 figures
♻ ☆ Learning a Generalized Physical Face Model From Data
Physically-based simulation is a powerful approach for 3D facial animation as the resulting deformations are governed by physical constraints, allowing to easily resolve self-collisions, respond to external forces and perform realistic anatomy edits. Today's methods are data-driven, where the actuations for finite elements are inferred from captured skin geometry. Unfortunately, these approaches have not been widely adopted due to the complexity of initializing the material space and learning the deformation model for each character separately, which often requires a skilled artist followed by lengthy network training. In this work, we aim to make physics-based facial animation more accessible by proposing a generalized physical face model that we learn from a large 3D face dataset. Once trained, our model can be quickly fit to any unseen identity and produce a ready-to-animate physical face model automatically. Fitting is as easy as providing a single 3D face scan, or even a single face image. After fitting, we offer intuitive animation controls, as well as the ability to retarget animations across characters. All the while, the resulting animations allow for physical effects like collision avoidance, gravity, paralysis, bone reshaping and more.
♻ ☆ $OC^4-ReID$: Occluded Cloth-Changing Person Re-Identification
The study of Cloth-Changing Person Re-identification (CC-ReID) focuses on retrieving specific pedestrians when their clothing has changed, typically under the assumption that the entire pedestrian images are visible. Pedestrian images in real-world scenarios, however, are often partially obscured by obstacles, presenting a significant challenge to existing CC-ReID systems. In this paper, we introduce a more challenging task termed Occluded Cloth-Changing Person Re-Identification ($OC^4-ReID$), which simultaneously addresses two challenges of clothing changes and occlusion. Concretely, we construct two new datasets, Occ-LTCC and Occ-PRCC, based on original CC-ReID datasets to include random occlusions of key pedestrians components (e.g., head, torso). Moreover, a novel benchmark is proposed for $OC^4-ReID$ incorporating a Train-Test Micro Granularity Screening ($T^2MGS$) module to mitigate the influence of occlusion and proposing a Part-Robust Triplet (PRT) loss for partial features learning. Comprehensive experiments on the proposed datasets, as well as on two CC-ReID benchmark datasets demonstrate the superior performance of proposed method against other state-of-the-art methods. The codes and datasets are available at: https://github.com/1024AILab/OC4-ReID.
♻ ☆ Restorer: Removing Multi-Degradation with All-Axis Attention and Prompt Guidance
There are many excellent solutions in image restoration.However, most methods require on training separate models to restore images with different types of degradation.Although existing all-in-one models effectively address multiple types of degradation simultaneously, their performance in real-world scenarios is still constrained by the task confusion problem.In this work, we attempt to address this issue by introducing \textbf{Restorer}, a novel Transformer-based all-in-one image restoration model.To effectively address the complex degradation present in real-world images, we propose All-Axis Attention (AAA), a mechanism that simultaneously models long-range dependencies across both spatial and channel dimensions, capturing potential correlations along all axes.Additionally, we introduce textual prompts in Restorer to incorporate explicit task priors, enabling the removal of specific degradation types based on user instructions. By iterating over these prompts, Restorer can handle composite degradation in real-world scenarios without requiring additional training.Based on these designs, Restorer with one set of parameters demonstrates state-of-the-art performance in multiple image restoration tasks compared to existing all-in-one and even single-task models.Additionally, Restorer is efficient during inference, suggesting the potential in real-world applications.
♻ ☆ On the Federated Learning Framework for Cooperative Perception
Cooperative perception is essential to enhance the efficiency and safety of future transportation systems, requiring extensive data sharing among vehicles on the road, which raises significant privacy concerns. Federated learning offers a promising solution by enabling data privacy-preserving collaborative enhancements in perception, decision-making, and planning among connected and autonomous vehicles (CAVs). However, federated learning is impeded by significant challenges arising from data heterogeneity across diverse clients, potentially diminishing model accuracy and prolonging convergence periods. This study introduces a specialized federated learning framework for CP, termed the federated dynamic weighted aggregation (FedDWA) algorithm, facilitated by dynamic adjusting loss (DALoss) function. This framework employs dynamic client weighting to direct model convergence and integrates a novel loss function that utilizes Kullback-Leibler divergence (KLD) to counteract the detrimental effects of non-independently and identically distributed (Non-IID) and unbalanced data. Utilizing the BEV transformer as the primary model, our rigorous testing on the OpenV2V dataset, augmented with FedBEVT data, demonstrates significant improvements in the average intersection over union (IoU). These results highlight the substantial potential of our federated learning framework to address data heterogeneity challenges in CP, thereby enhancing the accuracy of environmental perception models and facilitating more robust and efficient collaborative learning solutions in the transportation sector.
comment: accepted by IEEE RA-L
♻ ☆ 3DGS.zip: A survey on 3D Gaussian Splatting Compression Methods
We present a work-in-progress survey on 3D Gaussian Splatting compression methods, focusing on their statistical performance across various benchmarks. This survey aims to facilitate comparability by summarizing key statistics of different compression approaches in a tabulated format. The datasets evaluated include TanksAndTemples, MipNeRF360, DeepBlending, and SyntheticNeRF. For each method, we report the Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index (SSIM), Learned Perceptual Image Patch Similarity (LPIPS), and the resultant size in megabytes (MB), as provided by the respective authors. This is an ongoing, open project, and we invite contributions from the research community as GitHub issues or pull requests. Please visit http://w-m.github.io/3dgs-compression-survey/ for more information and a sortable version of the table.
comment: 3D Gaussian Splatting compression survey; 3DGS compression; new approaches added
♻ ☆ SUMix: Mixup with Semantic and Uncertain Information ECCV2024
Mixup data augmentation approaches have been applied for various tasks of deep learning to improve the generalization ability of deep neural networks. Some existing approaches CutMix, SaliencyMix, etc. randomly replace a patch in one image with patches from another to generate the mixed image. Similarly, the corresponding labels are linearly combined by a fixed ratio $\lambda$ by l. The objects in two images may be overlapped during the mixing process, so some semantic information is corrupted in the mixed samples. In this case, the mixed image does not match the mixed label information. Besides, such a label may mislead the deep learning model training, which results in poor performance. To solve this problem, we proposed a novel approach named SUMix to learn the mixing ratio as well as the uncertainty for the mixed samples during the training process. First, we design a learnable similarity function to compute an accurate mix ratio. Second, an approach is investigated as a regularized term to model the uncertainty of the mixed samples. We conduct experiments on five image benchmarks, and extensive experimental results imply that our method is capable of improving the performance of classifiers with different cutting-based mixup approaches. The source code is available at https://github.com/JinXins/SUMix.
comment: Accepted by ECCV2024 [Camera Ready] (19 pages, 7 figures) with the source code at https://github.com/JinXins/SUMix
♻ ☆ SPIdepth: Strengthened Pose Information for Self-supervised Monocular Depth Estimation
Self-supervised monocular depth estimation has garnered considerable attention for its applications in autonomous driving and robotics. While recent methods have made strides in leveraging techniques like the Self Query Layer (SQL) to infer depth from motion, they often overlook the potential of strengthening pose information. In this paper, we introduce SPIdepth, a novel approach that prioritizes enhancing the pose network for improved depth estimation. Building upon the foundation laid by SQL, SPIdepth emphasizes the importance of pose information in capturing fine-grained scene structures. By enhancing the pose network's capabilities, SPIdepth achieves remarkable advancements in scene understanding and depth estimation. Experimental results on benchmark datasets such as KITTI, Cityscapes, and Make3D showcase SPIdepth's state-of-the-art performance, surpassing previous methods by significant margins. Specifically, SPIdepth tops the self-supervised KITTI benchmark. Additionally, SPIdepth achieves the lowest AbsRel (0.029), SqRel (0.069), and RMSE (1.394) on KITTI, establishing new state-of-the-art results. On Cityscapes, SPIdepth shows improvements over SQLdepth of 21.7% in AbsRel, 36.8% in SqRel, and 16.5% in RMSE, even without using motion masks. On Make3D, SPIdepth in zero-shot outperforms all other models. Remarkably, SPIdepth achieves these results using only a single image for inference, surpassing even methods that utilize video sequences for inference, thus demonstrating its efficacy and efficiency in real-world applications. Our approach represents a significant leap forward in self-supervised monocular depth estimation, underscoring the importance of strengthening pose information for advancing scene understanding in real-world applications. The code and pre-trained models are publicly available at https://github.com/Lavreniuk/SPIdepth.
♻ ☆ Realigned Softmax Warping for Deep Metric Learning
Deep Metric Learning (DML) loss functions traditionally aim to control the forces of separability and compactness within an embedding space so that the same class data points are pulled together and different class ones are pushed apart. Within the context of DML, a softmax operation will typically normalize distances into a probability for optimization, thus coupling all the push/pull forces together. This paper proposes a potential new class of loss functions that operate within a euclidean domain and aim to take full advantage of the coupled forces governing embedding space formation under a softmax. These forces of compactness and separability can be boosted or mitigated within controlled locations at will by using a warping function. In this work, we provide a simple example of a warping function and use it to achieve competitive, state-of-the-art results on various metric learning benchmarks.
comment: Preprint
♻ ☆ Progressive Domain Adaptation for Thermal Infrared Object Tracking
Due to the lack of large-scale labeled Thermal InfraRed (TIR) training datasets, most existing TIR trackers are trained directly on RGB datasets. However, tracking methods trained on RGB datasets suffer a significant drop-off in TIR data due to the domain shift issue. To this end, in this work, we propose a Progressive Domain Adaptation framework for TIR Tracking (PDAT), which transfers useful knowledge learned from RGB tracking to TIR tracking. The framework makes full use of large-scale labeled RGB datasets without requiring time-consuming and labor-intensive labeling of large-scale TIR data. Specifically, we first propose an adversarial-based global domain adaptation module to reduce domain gap on the feature level coarsely. Second, we design a clustering-based subdomain adaptation method to further align the feature distributions of the RGB and TIR datasets finely. These two domain adaptation modules gradually eliminate the discrepancy between the two domains, and thus learn domain-invariant fine-grained features through progressive training. Additionally, we collect a largescale TIR dataset with over 1.48 million unlabeled TIR images for training the proposed domain adaptation framework. Experimental results on five TIR tracking benchmarks show that the proposed method gains a nearly 6% success rate, demonstrating its effectiveness.
comment: 10 pages, 8 figures
♻ ☆ An Efficient Instance Segmentation Framework Using Segmentation Foundation Models with Oriented Bounding Box Prompts
Instance segmentation in unmanned aerial vehicle measurement is a long-standing challenge. Since horizontal bounding boxes introduce many interference objects, oriented bounding boxes (OBBs) are usually used for instance identification. However, based on ``segmentation within bounding box'' paradigm, current instance segmentation methods using OBBs are overly dependent on bounding box detection performance. To tackle this, this paper proposes OBSeg, an efficient instance segmentation framework using OBBs. OBSeg is based on box prompt-based segmentation foundation models (BSMs), e.g., Segment Anything Model. Specifically, OBSeg first detects OBBs to distinguish instances and provide coarse localization information. Then, it predicts OBB prompt-related masks for fine segmentation. Since OBBs only serve as prompts, OBSeg alleviates the over-dependence on bounding box detection performance of current instance segmentation methods using OBBs. In addition, to enable BSMs to handle OBB prompts, we propose a novel OBB prompt encoder. To make OBSeg more lightweight and further improve the performance of lightweight distilled BSMs, a Gaussian smoothing-based knowledge distillation method is introduced. Experiments demonstrate that OBSeg outperforms current instance segmentation methods on multiple public datasets. The code is available at https://github.com/zhen6618/OBBInstanceSegmentation.
♻ ☆ Collaborative Group: Composed Image Retrieval via Consensus Learning from Noisy Annotations
Composed image retrieval extends content-based image retrieval systems by enabling users to search using reference images and captions that describe their intention. Despite great progress in developing image-text compositors to extract discriminative visual-linguistic features, we identify a hitherto overlooked issue, triplet ambiguity, which impedes robust feature extraction. Triplet ambiguity refers to a type of semantic ambiguity that arises between the reference image, the relative caption, and the target image. It is mainly due to the limited representation of the annotated text, resulting in many noisy triplets where multiple visually dissimilar candidate images can be matched to an identical reference pair (i.e., a reference image + a relative caption). To address this challenge, we propose the Consensus Network (Css-Net), inspired by the psychological concept that groups outperform individuals. Css-Net comprises two core components: (1) a consensus module with four diverse compositors, each generating distinct image-text embeddings, fostering complementary feature extraction and mitigating dependence on any single, potentially biased compositor; (2) a Kullback-Leibler divergence loss that encourages learning of inter-compositor interactions to promote consensual outputs. During evaluation, the decisions of the four compositors are combined through a weighting scheme, enhancing overall agreement. On benchmark datasets, particularly FashionIQ, Css-Net demonstrates marked improvements. Notably, it achieves significant recall gains, with a 2.77% increase in R@10 and 6.67% boost in R@50, underscoring its competitiveness in addressing the fundamental limitations of existing methods.
comment: Accepted by Knowledge-Based Systems (KBS)
♻ ☆ CAST: Cross-Attention in Space and Time for Video Action Recognition NeurIPS 2023
Recognizing human actions in videos requires spatial and temporal understanding. Most existing action recognition models lack a balanced spatio-temporal understanding of videos. In this work, we propose a novel two-stream architecture, called Cross-Attention in Space and Time (CAST), that achieves a balanced spatio-temporal understanding of videos using only RGB input. Our proposed bottleneck cross-attention mechanism enables the spatial and temporal expert models to exchange information and make synergistic predictions, leading to improved performance. We validate the proposed method with extensive experiments on public benchmarks with different characteristics: EPIC-KITCHENS-100, Something-Something-V2, and Kinetics-400. Our method consistently shows favorable performance across these datasets, while the performance of existing methods fluctuates depending on the dataset characteristics.
comment: This is an accepted NeurIPS 2023. Project webpage is available at https://jong980812.github.io/CAST.github.io/ Code is available at https://github.com/KHU-VLL/CAST
♻ ☆ TALDS-Net: Task-Aware Adaptive Local Descriptors Selection for Few-shot Image Classification ICASSP 2024
Few-shot image classification aims to classify images from unseen novel classes with few samples. Recent works demonstrate that deep local descriptors exhibit enhanced representational capabilities compared to image-level features. However, most existing methods solely rely on either employing all local descriptors or directly utilizing partial descriptors, potentially resulting in the loss of crucial information. Moreover, these methods primarily emphasize the selection of query descriptors while overlooking support descriptors. In this paper, we propose a novel Task-Aware Adaptive Local Descriptors Selection Network (TALDS-Net), which exhibits the capacity for adaptive selection of task-aware support descriptors and query descriptors. Specifically, we compare the similarity of each local support descriptor with other local support descriptors to obtain the optimal support descriptor subset and then compare the query descriptors with the optimal support subset to obtain discriminative query descriptors. Extensive experiments demonstrate that our TALDS-Net outperforms state-of-the-art methods on both general and fine-grained datasets.
comment: 4 pages, 1 figures, is accepted by ICASSP 2024
♻ ☆ Asynchronous Blob Tracker for Event Cameras
Event-based cameras are popular for tracking fast-moving objects due to their high temporal resolution, low latency, and high dynamic range. In this paper, we propose a novel algorithm for tracking event blobs using raw events asynchronously in real time. We introduce the concept of an event blob as a spatio-temporal likelihood of event occurrence where the conditional spatial likelihood is blob-like. Many real-world objects such as car headlights or any quickly moving foreground objects generate event blob data. The proposed algorithm uses a nearest neighbour classifier with a dynamic threshold criteria for data association coupled with an extended Kalman filter to track the event blob state. Our algorithm achieves highly accurate blob tracking, velocity estimation, and shape estimation even under challenging lighting conditions and high-speed motions (> 11000 pixels/s). The microsecond time resolution achieved means that the filter output can be used to derive secondary information such as time-to-contact or range estimation, that will enable applications to real-world problems such as collision avoidance in autonomous driving.
comment: 18 pages, 16 figures, Manuscript was accepted on August 7, 2024, by IEEE Transactions on Robotics
♻ ☆ Rethinking Barely-Supervised Volumetric Medical Image Segmentation from an Unsupervised Domain Adaptation Perspective
This paper investigates an extremely challenging problem: barely-supervised volumetric medical image segmentation (BSS). A BSS training dataset consists of two parts: 1) a barely-annotated labeled set, where each labeled image contains only a single-slice annotation, and 2) an unlabeled set comprising numerous unlabeled volumetric images. State-of-the-art BSS methods employ a registration-based paradigm, which uses inter-slice image registration to propagate single-slice annotations into volumetric pseudo labels, constructing a completely annotated labeled set, to which a semi-supervised segmentation scheme can be applied. However, the paradigm has a critical limitation: the pseudo-labels generated by image registration are unreliable and noisy. Motivated by this, we propose a new perspective: instead of solving BSS within a semi-supervised learning scheme, this work formulates BSS as an unsupervised domain adaptation problem. To this end, we propose a novel BSS framework, \textbf{B}arely-supervised learning \textbf{via} unsupervised domain \textbf{A}daptation (BvA), as an alternative to the dominant registration paradigm. Specifically, we first design a novel noise-free labeled data construction algorithm (NFC) for slice-to-volume labeled data synthesis. Then, we introduce a frequency and spatial Mix-Up strategy (FSX) to mitigate the domain shifts. Extensive experiments demonstrate that our method provides a promising alternative for BSS. Remarkably, the proposed method, trained on the left atrial segmentation dataset with \textbf{only one} barely-labeled image, achieves a Dice score of 81.20%, outperforming the state-of-the-art by 61.71%. The code is available at \href{https://github.com/Senyh/BvA}{\textit{\texttt{https://github.com/Senyh/BvA}}}.
♻ ☆ Planning and Rendering: Towards Product Poster Generation with Diffusion Models
Product poster generation significantly optimizes design efficiency and reduces production costs. Prevailing methods predominantly rely on image-inpainting methods to generate clean background images for given products. Subsequently, poster layout generation methods are employed to produce corresponding layout results. However, the background images may not be suitable for accommodating textual content due to their complexity, and the fixed location of products limits the diversity of layout results. To alleviate these issues, we propose a novel product poster generation framework based on diffusion models named P\&R. The P\&R draws inspiration from the workflow of designers in creating posters, which consists of two stages: Planning and Rendering. At the planning stage, we propose a PlanNet to generate the layout of the product and other visual components considering both the appearance features of the product and semantic features of the text, which improves the diversity and rationality of the layouts. At the rendering stage, we propose a RenderNet to generate the background for the product while considering the generated layout, where a spatial fusion module is introduced to fuse the layout of different visual components. To foster the advancement of this field, we propose the first product poster generation dataset PPG30k, comprising 30k exquisite product poster images along with comprehensive image and text annotations. Our method outperforms the state-of-the-art product poster generation methods on PPG30k. The PPG30k will be released soon.
♻ ☆ Unveiling the Human-like Similarities of Automatic Facial Expression Recognition: An Empirical Exploration through Explainable AI
Facial expression recognition is vital for human behavior analysis, and deep learning has enabled models that can outperform humans. However, it is unclear how closely they mimic human processing. This study aims to explore the similarity between deep neural networks and human perception by comparing twelve different networks, including both general object classifiers and FER-specific models. We employ an innovative global explainable AI method to generate heatmaps, revealing crucial facial regions for the twelve networks trained on six facial expressions. We assess these results both quantitatively and qualitatively, comparing them to ground truth masks based on Friesen and Ekman's description and among them. We use Intersection over Union (IoU) and normalized correlation coefficients for comparisons. We generate 72 heatmaps to highlight critical regions for each expression and architecture. Qualitatively, models with pre-trained weights show more similarity in heatmaps compared to those without pre-training. Specifically, eye and nose areas influence certain facial expressions, while the mouth is consistently important across all models and expressions. Quantitatively, we find low average IoU values (avg. 0.2702) across all expressions and architectures. The best-performing architecture averages 0.3269, while the worst-performing one averages 0.2066. Dendrograms, built with the normalized correlation coefficient, reveal two main clusters for most expressions: models with pre-training and models without pre-training. Findings suggest limited alignment between human and AI facial expression recognition, with network architectures influencing the similarity, as similar architectures prioritize similar facial regions.
comment: Multimed Tools Appl (2024)
♻ ☆ Learning Exposure Correction in Dynamic Scenes
Exposure correction aims to enhance visual data suffering from improper exposures, which can greatly improve satisfactory visual effects. However, previous methods mainly focus on the image modality, and the video counterpart is less explored in the literature. Directly applying prior image-based methods to videos results in temporal incoherence with low visual quality. Through thorough investigation, we find that the development of relevant communities is limited by the absence of a benchmark dataset. Therefore, in this paper, we construct the first real-world paired video dataset, including both underexposure and overexposure dynamic scenes. To achieve spatial alignment, we utilize two DSLR cameras and a beam splitter to simultaneously capture improper and normal exposure videos. Additionally, we propose an end-to-end video exposure correction network, in which a dual-stream module is designed to deal with both underexposure and overexposure factors, enhancing the illumination based on Retinex theory. The extensive experiments based on various metrics and user studies demonstrate the significance of our dataset and the effectiveness of our method. The code and dataset are available at https://github.com/kravrolens/VECNet.
comment: To be published at ACM Multimedia 2024
♻ ☆ Deep Learning for Computer Vision based Activity Recognition and Fall Detection of the Elderly: a Systematic Review
As the percentage of elderly people in developed countries increases worldwide, the healthcare of this collective is a worrying matter, especially if it includes the preservation of their autonomy. In this direction, many studies are being published on Ambient Assisted Living (AAL) systems, which help to reduce the preoccupations raised by the independent living of the elderly. In this study, a systematic review of the literature is presented on fall detection and Human Activity Recognition (HAR) for the elderly, as the two main tasks to solve to guarantee the safety of elderly people living alone. To address the current tendency to perform these two tasks, the review focuses on the use of Deep Learning (DL) based approaches on computer vision data. In addition, different collections of data like DL models, datasets or hardware (e.g. depth or thermal cameras) are gathered from the reviewed studies and provided for reference in future studies. Strengths and weaknesses of existing approaches are also discussed and, based on them, our recommendations for future works are provided.
♻ ☆ RefSAM: Efficiently Adapting Segmenting Anything Model for Referring Video Object Segmentation
The Segment Anything Model (SAM) has gained significant attention for its impressive performance in image segmentation. However, it lacks proficiency in referring video object segmentation (RVOS) due to the need for precise user-interactive prompts and a limited understanding of different modalities, such as language and vision. This paper presents the RefSAM model, which explores the potential of SAM for RVOS by incorporating multi-view information from diverse modalities and successive frames at different timestamps in an online manner. Our proposed approach adapts the original SAM model to enhance cross-modality learning by employing a lightweight Cross-Modal MLP that projects the text embedding of the referring expression into sparse and dense embeddings, serving as user-interactive prompts. Additionally, we have introduced the hierarchical dense attention module to fuse hierarchical visual semantic information with sparse embeddings to obtain fine-grained dense embeddings, and an implicit tracking module to generate a tracking token and provide historical information for the mask decoder. Furthermore, we employ a parameter-efficient tuning strategy to align and fuse the language and vision features effectively. Through comprehensive ablation studies, we demonstrate our model's practical and effective design choices. Extensive experiments conducted on Refer-Youtube-VOS, Ref-DAVIS17, and three referring image segmentation datasets validate the superiority and effectiveness of our RefSAM model over existing methods.
♻ ☆ PointRWKV: Efficient RWKV-Like Model for Hierarchical Point Cloud Learning
Transformers have revolutionized the point cloud learning task, but the quadratic complexity hinders its extension to long sequence and makes a burden on limited computational resources. The recent advent of RWKV, a fresh breed of deep sequence models, has shown immense potential for sequence modeling in NLP tasks. In this paper, we present PointRWKV, a model of linear complexity derived from the RWKV model in the NLP field with necessary modifications for point cloud learning tasks. Specifically, taking the embedded point patches as input, we first propose to explore the global processing capabilities within PointRWKV blocks using modified multi-headed matrix-valued states and a dynamic attention recurrence mechanism. To extract local geometric features simultaneously, we design a parallel branch to encode the point cloud efficiently in a fixed radius near-neighbors graph with a graph stabilizer. Furthermore, we design PointRWKV as a multi-scale framework for hierarchical feature learning of 3D point clouds, facilitating various downstream tasks. Extensive experiments on different point cloud learning tasks show our proposed PointRWKV outperforms the transformer- and mamba-based counterparts, while significantly saving about 42\% FLOPs, demonstrating the potential option for constructing foundational 3D models.
♻ ☆ White-Box Transformers via Sparse Rate Reduction: Compression Is All There Is?
In this paper, we contend that a natural objective of representation learning is to compress and transform the distribution of the data, say sets of tokens, towards a low-dimensional Gaussian mixture supported on incoherent subspaces. The goodness of such a representation can be evaluated by a principled measure, called sparse rate reduction, that simultaneously maximizes the intrinsic information gain and extrinsic sparsity of the learned representation. From this perspective, popular deep network architectures, including transformers, can be viewed as realizing iterative schemes to optimize this measure. Particularly, we derive a transformer block from alternating optimization on parts of this objective: the multi-head self-attention operator compresses the representation by implementing an approximate gradient descent step on the coding rate of the features, and the subsequent multi-layer perceptron sparsifies the features. This leads to a family of white-box transformer-like deep network architectures, named CRATE, which are mathematically fully interpretable. We show, by way of a novel connection between denoising and compression, that the inverse to the aforementioned compressive encoding can be realized by the same class of CRATE architectures. Thus, the so-derived white-box architectures are universal to both encoders and decoders. Experiments show that these networks, despite their simplicity, indeed learn to compress and sparsify representations of large-scale real-world image and text datasets, and achieve performance very close to highly engineered transformer-based models: ViT, MAE, DINO, BERT, and GPT2. We believe the proposed computational framework demonstrates great potential in bridging the gap between theory and practice of deep learning, from a unified perspective of data compression. Code is available at: https://ma-lab-berkeley.github.io/CRATE .
comment: Accepted at Journal of Machine Learning Research. This paper integrates the works arXiv:2306.01129 and arXiv:2308.16271 into a complete story. In this paper, we improve the writing and organization, and also add conceptual, empirical, and theoretical improvements over the previous work. V2: small typo fixes and formatting improvements. V3: improvements from journal revisions
♻ ☆ Correlation-Embedded Transformer Tracking: A Single-Branch Framework
Developing robust and discriminative appearance models has been a long-standing research challenge in visual object tracking. In the prevalent Siamese-based paradigm, the features extracted by the Siamese-like networks are often insufficient to model the tracked targets and distractor objects, thereby hindering them from being robust and discriminative simultaneously. While most Siamese trackers focus on designing robust correlation operations, we propose a novel single-branch tracking framework inspired by the transformer. Unlike the Siamese-like feature extraction, our tracker deeply embeds cross-image feature correlation in multiple layers of the feature network. By extensively matching the features of the two images through multiple layers, it can suppress non-target features, resulting in target-aware feature extraction. The output features can be directly used for predicting target locations without additional correlation steps. Thus, we reformulate the two-branch Siamese tracking as a conceptually simple, fully transformer-based Single-Branch Tracking pipeline, dubbed SBT. After conducting an in-depth analysis of the SBT baseline, we summarize many effective design principles and propose an improved tracker dubbed SuperSBT. SuperSBT adopts a hierarchical architecture with a local modeling layer to enhance shallow-level features. A unified relation modeling is proposed to remove complex handcrafted layer pattern designs. SuperSBT is further improved by masked image modeling pre-training, integrating temporal modeling, and equipping with dedicated prediction heads. Thus, SuperSBT outperforms the SBT baseline by 4.7%,3.0%, and 4.5% AUC scores in LaSOT, TrackingNet, and GOT-10K. Notably, SuperSBT greatly raises the speed of SBT from 37 FPS to 81 FPS. Extensive experiments show that our method achieves superior results on eight VOT benchmarks.
comment: Extension of SBT paper, accepted by TPAMI
♻ ☆ Learning from the Web: Language Drives Weakly-Supervised Incremental Learning for Semantic Segmentation ECCV 2024
Current weakly-supervised incremental learning for semantic segmentation (WILSS) approaches only consider replacing pixel-level annotations with image-level labels, while the training images are still from well-designed datasets. In this work, we argue that widely available web images can also be considered for the learning of new classes. To achieve this, firstly we introduce a strategy to select web images which are similar to previously seen examples in the latent space using a Fourier-based domain discriminator. Then, an effective caption-driven reharsal strategy is proposed to preserve previously learnt classes. To our knowledge, this is the first work to rely solely on web images for both the learning of new concepts and the preservation of the already learned ones in WILSS. Experimental results show that the proposed approach can reach state-of-the-art performances without using manually selected and annotated data in the incremental steps.
comment: ECCV 2024
♻ ☆ Image-Based Virtual Try-On: A Survey
Image-based virtual try-on aims to synthesize a naturally dressed person image with a clothing image, which revolutionizes online shopping and inspires related topics within image generation, showing both research significance and commercial potential. However, there is a gap between current research progress and commercial applications and an absence of comprehensive overview of this field to accelerate the development.In this survey, we provide a comprehensive analysis of the state-of-the-art techniques and methodologies in aspects of pipeline architecture, person representation and key modules such as try-on indication, clothing warping and try-on stage. We additionally apply CLIP to assess the semantic alignment of try-on results, and evaluate representative methods with uniformly implemented evaluation metrics on the same dataset.In addition to quantitative and qualitative evaluation of current open-source methods, unresolved issues are highlighted and future research directions are prospected to identify key trends and inspire further exploration. The uniformly implemented evaluation metrics, dataset and collected methods will be made public available at https://github.com/little-misfit/Survey-Of-Virtual-Try-On.
comment: 30 pages, 20 figures
♻ ☆ DocKylin: A Large Multimodal Model for Visual Document Understanding with Efficient Visual Slimming
Current multimodal large language models (MLLMs) face significant challenges in visual document understanding (VDU) tasks due to the high resolution, dense text, and complex layouts typical of document images. These characteristics demand a high level of detail perception ability from MLLMs. While increasing input resolution improves detail perception capability, it also leads to longer sequences of visual tokens, increasing computational costs and straining the models' ability to handle long contexts. To address these challenges, we introduce DocKylin, a document-centric MLLM that performs visual content slimming at both the pixel and token levels, thereby reducing token sequence length in VDU scenarios. We introduce an Adaptive Pixel Slimming (APS) preprocessing module to perform pixel-level slimming, increasing the proportion of informative pixels. Moreover, we propose a novel Dynamic Token Slimming (DTS) module to conduct token-level slimming, filtering essential tokens and removing others to adaptively create a more compact visual sequence. Experiments demonstrate DocKylin's promising performance across various VDU benchmarks and the effectiveness of each component.
♻ ☆ Towards reliable respiratory disease diagnosis based on cough sounds and vision transformers
Recent advancements in deep learning techniques have sparked performance boosts in various real-world applications including disease diagnosis based on multi-modal medical data. Cough sound data-based respiratory disease (e.g., COVID-19 and Chronic Obstructive Pulmonary Disease) diagnosis has also attracted much attention. However, existing works usually utilise traditional machine learning or deep models of moderate scales. On the other hand, the developed approaches are trained and evaluated on small-scale data due to the difficulty of curating and annotating clinical data on scale. To address these issues in prior works, we create a unified framework to evaluate various deep models from lightweight Convolutional Neural Networks (e.g., ResNet18) to modern vision transformers and compare their performance in respiratory disease classification. Based on the observations from such an extensive empirical study, we propose a novel approach to cough-based disease classification based on both self-supervised and supervised learning on a large-scale cough data set. Experimental results demonstrate our proposed approach outperforms prior arts consistently on two benchmark datasets for COVID-19 diagnosis and a proprietary dataset for COPD/non-COPD classification with an AUROC of 92.5%.
♻ ☆ Learn Suspected Anomalies from Event Prompts for Video Anomaly Detection
Most models for weakly supervised video anomaly detection (WS-VAD) rely on multiple instance learning, aiming to distinguish normal and abnormal snippets without specifying the type of anomaly. However, the ambiguous nature of anomaly definitions across contexts may introduce inaccuracy in discriminating abnormal and normal events. To show the model what is anomalous, a novel framework is proposed to guide the learning of suspected anomalies from event prompts. Given a textual prompt dictionary of potential anomaly events and the captions generated from anomaly videos, the semantic anomaly similarity between them could be calculated to identify the suspected events for each video snippet. It enables a new multi-prompt learning process to constrain the visual-semantic features across all videos, as well as provides a new way to label pseudo anomalies for self-training. To demonstrate its effectiveness, comprehensive experiments and detailed ablation studies are conducted on four datasets, namely XD-Violence, UCF-Crime, TAD, and ShanghaiTech. Our proposed model outperforms most state-of-the-art methods in terms of AP or AUC (86.5\%, \hl{90.4}\%, 94.4\%, and 97.4\%). Furthermore, it shows promising performance in open-set and cross-dataset cases. The data, code, and models can be found at: \url{https://github.com/shiwoaz/lap}.
♻ ☆ TagCLIP: Improving Discrimination Ability of Open-Vocabulary Semantic Segmentation
Contrastive Language-Image Pre-training (CLIP) has recently shown great promise in pixel-level zero-shot learning tasks. However, existing approaches utilizing CLIP's text and patch embeddings to generate semantic masks often misidentify input pixels from unseen classes, leading to confusion between novel classes and semantically similar ones. In this work, we propose a novel approach, TagCLIP (Trusty-aware guided CLIP), to address this issue. We disentangle the ill-posed optimization problem into two parallel processes: semantic matching performed individually and reliability judgment for improving discrimination ability. Building on the idea of special tokens in language modeling representing sentence-level embeddings, we introduce a trusty token that enables distinguishing novel classes from known ones in prediction. To evaluate our approach, we conduct experiments on two benchmark datasets, PASCAL VOC 2012, COCO-Stuff 164K and PASCAL Context. Our results show that TagCLIP improves the Intersection over Union (IoU) of unseen classes by 7.4%, 1.7% and 2.1%, respectively, with negligible overheads. The code is available at https://github.com/dvlab-research/TagCLIP.
comment: TPAMI2024
♻ ☆ Cross-Platform Video Person ReID: A New Benchmark Dataset and Adaptation Approach ECCV 2024
In this paper, we construct a large-scale benchmark dataset for Ground-to-Aerial Video-based person Re-Identification, named G2A-VReID, which comprises 185,907 images and 5,576 tracklets, featuring 2,788 distinct identities. To our knowledge, this is the first dataset for video ReID under Ground-to-Aerial scenarios. G2A-VReID dataset has the following characteristics: 1) Drastic view changes; 2) Large number of annotated identities; 3) Rich outdoor scenarios; 4) Huge difference in resolution. Additionally, we propose a new benchmark approach for cross-platform ReID by transforming the cross-platform visual alignment problem into visual-semantic alignment through vision-language model (i.e., CLIP) and applying a parameter-efficient Video Set-Level-Adapter module to adapt image-based foundation model to video ReID tasks, termed VSLA-CLIP. Besides, to further reduce the great discrepancy across the platforms, we also devise the platform-bridge prompts for efficient visual feature alignment. Extensive experiments demonstrate the superiority of the proposed method on all existing video ReID datasets and our proposed G2A-VReID dataset.
comment: Published at ECCV 2024
♻ ☆ The Impact of Print-Scanning in Heterogeneous Morph Evaluation Scenarios
Face morphing attacks pose an increasing threat to face recognition (FR) systems. A morphed photo contains biometric information from two different subjects to take advantage of vulnerabilities in FRs. These systems are particularly susceptible to attacks when the morphs are subjected to print-scanning to mask the artifacts generated during the morphing process. We investigate the impact of print-scanning on morphing attack detection through a series of evaluations on heterogeneous morphing attack scenarios. Our experiments show that we can increase the Mated Morph Presentation Match Rate (MMPMR) by up to 8.48%. Furthermore, when a Single-image Morphing Attack Detection (S-MAD) algorithm is not trained to detect print-scanned morphs the Morphing Attack Classification Error Rate (MACER) can increase by up to 96.12%, indicating significant vulnerability.
comment: Accepted as a special sessions paper at IJCB 2024
♻ ☆ AIGCs Confuse AI Too: Investigating and Explaining Synthetic Image-induced Hallucinations in Large Vision-Language Models
The evolution of Artificial Intelligence Generated Contents (AIGCs) is advancing towards higher quality. The growing interactions with AIGCs present a new challenge to the data-driven AI community: While AI-generated contents have played a crucial role in a wide range of AI models, the potential hidden risks they introduce have not been thoroughly examined. Beyond human-oriented forgery detection, AI-generated content poses potential issues for AI models originally designed to process natural data. In this study, we underscore the exacerbated hallucination phenomena in Large Vision-Language Models (LVLMs) caused by AI-synthetic images. Remarkably, our findings shed light on a consistent AIGC \textbf{hallucination bias}: the object hallucinations induced by synthetic images are characterized by a greater quantity and a more uniform position distribution, even these synthetic images do not manifest unrealistic or additional relevant visual features compared to natural images. Moreover, our investigations on Q-former and Linear projector reveal that synthetic images may present token deviations after visual projection, thereby amplifying the hallucination bias.
♻ ☆ Enhancing Representation in Radiography-Reports Foundation Model: A Granular Alignment Algorithm Using Masked Contrastive Learning
Recently, multi-modal vision-language foundation models have gained significant attention in the medical field. While these models offer great opportunities, they still face crucial challenges, such as the requirement for fine-grained knowledge understanding in computer-aided diagnosis and the capability of utilizing very limited or even no task-specific labeled data in real-world clinical applications. In this study, we present MaCo, a masked contrastive chest X-ray foundation model that tackles these challenges. MaCo explores masked contrastive learning to simultaneously achieve fine-grained image understanding and zero-shot learning for a variety of medical imaging tasks. It designs a correlation weighting mechanism to adjust the correlation between masked chest X-ray image patches and their corresponding reports, thereby enhancing the model's representation learning capabilities. To evaluate the performance of MaCo, we conducted extensive experiments using 6 well-known open-source X-ray datasets. The experimental results demonstrate the superiority of MaCo over 10 state-of-the-art approaches across tasks such as classification, segmentation, detection, and phrase grounding. These findings highlight the significant potential of MaCo in advancing a wide range of medical image analysis tasks.
♻ ☆ BrainVis: Exploring the Bridge between Brain and Visual Signals via Image Reconstruction
Analyzing and reconstructing visual stimuli from brain signals effectively advances the understanding of human visual system. However, the EEG signals are complex and contain significant noise. This leads to substantial limitations in existing works of visual stimuli reconstruction from EEG, such as difficulties in aligning EEG embeddings with the fine-grained semantic information and a heavy reliance on additional large self-collected dataset for training. To address these challenges, we propose a novel approach called BrainVis. Firstly, we divide the EEG signals into various units and apply a self-supervised approach on them to obtain EEG time-domain features, in an attempt to ease the training difficulty. Additionally, we also propose to utilize the frequency-domain features to enhance the EEG representations. Then, we simultaneously align EEG time-frequency embeddings with the interpolation of the coarse and fine-grained semantics in the CLIP space, to highlight the primary visual components and reduce the cross-modal alignment difficulty. Finally, we adopt the cascaded diffusion models to reconstruct images. Using only 10\% training data of the previous work, our proposed BrainVis outperforms state of the arts in both semantic fidelity reconstruction and generation quality. The code is available at https://github.com/RomGai/BrainVis.
♻ ☆ GISR: Geometric Initialization and Silhouette-based Refinement for Single-View Robot Pose and Configuration Estimation
In autonomous robotics, measurement of the robot's internal state and perception of its environment, including interaction with other agents such as collaborative robots, are essential. Estimating the pose of the robot arm from a single view has the potential to replace classical eye-to-hand calibration approaches and is particularly attractive for online estimation and dynamic environments. In addition to its pose, recovering the robot configuration provides a complete spatial understanding of the observed robot that can be used to anticipate the actions of other agents in advanced robotics use cases. Furthermore, this additional redundancy enables the planning and execution of recovery protocols in case of sensor failures or external disturbances. We introduce GISR - a deep configuration and robot-to-camera pose estimation method that prioritizes execution in real-time. GISR consists of two modules: (i) a geometric initialization module that efficiently computes an approximate robot pose and configuration, and (ii) a deep iterative silhouette-based refinement module that arrives at a final solution in just a few iterations. We evaluate GISR on publicly available data and show that it outperforms existing methods of the same class in terms of both speed and accuracy, and can compete with approaches that rely on ground-truth proprioception and recover only the pose.
comment: IEEE Robotics and Automation Letters (under revision), code available at http://github.com/iwhitey/GISR-robot
♻ ☆ IDNet: A Novel Dataset for Identity Document Analysis and Fraud Detection
Effective fraud detection and analysis of government-issued identity documents, such as passports, driver's licenses, and identity cards, are essential in thwarting identity theft and bolstering security on online platforms. The training of accurate fraud detection and analysis tools depends on the availability of extensive identity document datasets. However, current publicly available benchmark datasets for identity document analysis, including MIDV-500, MIDV-2020, and FMIDV, fall short in several respects: they offer a limited number of samples, cover insufficient varieties of fraud patterns, and seldom include alterations in critical personal identifying fields like portrait images, limiting their utility in training models capable of detecting realistic frauds while preserving privacy. In response to these shortcomings, our research introduces a new benchmark dataset, IDNet, designed to advance privacy-preserving fraud detection efforts. The IDNet dataset comprises 837,060 images of synthetically generated identity documents, totaling approximately 490 gigabytes, categorized into 20 types from $10$ U.S. states and 10 European countries. We evaluate the utility and present use cases of the dataset, illustrating how it can aid in training privacy-preserving fraud detection methods, facilitating the generation of camera and video capturing of identity documents, and testing schema unification and other identity document management functionalities.
comment: 40 pages
♻ ☆ Projected Stochastic Gradient Descent with Quantum Annealed Binary Gradients
We present, QP-SBGD, a novel layer-wise stochastic optimiser tailored towards training neural networks with binary weights, known as binary neural networks (BNNs), on quantum hardware. BNNs reduce the computational requirements and energy consumption of deep learning models with minimal loss in accuracy. However, training them in practice remains to be an open challenge. Most known BNN-optimisers either rely on projected updates or binarise weights post-training. Instead, QP-SBGD approximately maps the gradient onto binary variables, by solving a quadratic constrained binary optimisation. Under practically reasonable assumptions, we show that this update rule converges with a rate of $\mathcal{O}(1 / \sqrt{T})$. Moreover, we show how the $\mathcal{NP}$-hard projection can be effectively executed on an adiabatic quantum annealer, harnessing recent advancements in quantum computation. We also introduce a projected version of this update rule and prove that if a fixed point exists in the binary variable space, the modified updates will converge to it. Last but not least, our algorithm is implemented layer-wise, making it suitable to train larger networks on resource-limited quantum hardware. Through extensive evaluations, we show that QP-SBGD outperforms or is on par with competitive and well-established baselines such as BinaryConnect, signSGD and ProxQuant when optimising the Rosenbrock function, training BNNs as well as binary graph neural networks.
♻ ☆ Co-synthesis of Histopathology Nuclei Image-Label Pairs using a Context-Conditioned Joint Diffusion Model ECCV 2024
In multi-class histopathology nuclei analysis tasks, the lack of training data becomes a main bottleneck for the performance of learning-based methods. To tackle this challenge, previous methods have utilized generative models to increase data by generating synthetic samples. However, existing methods often overlook the importance of considering the context of biological tissues (e.g., shape, spatial layout, and tissue type) in the synthetic data. Moreover, while generative models have shown superior performance in synthesizing realistic histopathology images, none of the existing methods are capable of producing image-label pairs at the same time. In this paper, we introduce a novel framework for co-synthesizing histopathology nuclei images and paired semantic labels using a context-conditioned joint diffusion model. We propose conditioning of a diffusion model using nucleus centroid layouts with structure-related text prompts to incorporate spatial and structural context information into the generation targets. Moreover, we enhance the granularity of our synthesized semantic labels by generating instance-wise nuclei labels using distance maps synthesized concurrently in conjunction with the images and semantic labels. We demonstrate the effectiveness of our framework in generating high-quality samples on multi-institutional, multi-organ, and multi-modality datasets. Our synthetic data consistently outperforms existing augmentation methods in the downstream tasks of nuclei segmentation and classification.
comment: ECCV 2024 accepted
Information Retrieval
☆ SpannerLib: Embedding Declarative Information Extraction in an Imperative Workflow
Document spanners have been proposed as a formal framework for declarative Information Extraction (IE) from text, following IE products from the industry and academia. Over the past decade, the framework has been studied thoroughly in terms of expressive power, complexity, and the ability to naturally combine text analysis with relational querying. This demonstration presents SpannerLib a library for embedding document spanners in Python code. SpannerLib facilitates the development of IE programs by providing an implementation of Spannerlog (Datalog-based documentspanners) that interacts with the Python code in two directions: rules can be embedded inside Python, and they can invoke custom Python code (e.g., calls to ML-based NLP models) via user-defined functions. The demonstration scenarios showcase IE programs, with increasing levels of complexity, within Jupyter Notebook.
comment: 4 pages
☆ Laser: Parameter-Efficient LLM Bi-Tuning for Sequential Recommendation with Collaborative Information
Sequential recommender systems are essential for discerning user preferences from historical interactions and facilitating targeted recommendations. Recent innovations employing Large Language Models (LLMs) have advanced the field by encoding item semantics, yet they often necessitate substantial parameter tuning and are resource-demanding. Moreover, these works fails to consider the diverse characteristics of different types of users and thus diminishes the recommendation accuracy. In this paper, we propose a parameter-efficient Large Language Model Bi-Tuning framework for sequential recommendation with collaborative information (Laser). Specifically, Bi-Tuning works by inserting trainable virtual tokens at both the prefix and suffix of the input sequence and freezing the LLM parameters, thus optimizing the LLM for the sequential recommendation. In our Laser, the prefix is utilized to incorporate user-item collaborative information and adapt the LLM to the recommendation task, while the suffix converts the output embeddings of the LLM from the language space to the recommendation space for the follow-up item recommendation. Furthermore, to capture the characteristics of different types of users when integrating the collaborative information via the prefix, we introduce M-Former, a lightweight MoE-based querying transformer that uses a set of query experts to integrate diverse user-specific collaborative information encoded by frozen ID-based sequential recommender systems, significantly improving the accuracy of recommendations. Extensive experiments on real-world datasets demonstrate that Laser can parameter-efficiently adapt LLMs to effective recommender systems, significantly outperforming state-of-the-art methods.
comment: 11 pages, 4 figures
☆ Blockchain-based Federated Recommendation with Incentive Mechanism
Nowadays, federated recommendation technology is rapidly evolving to help multiple organisations share data and train models while meeting user privacy, data security and government regulatory requirements. However, federated recommendation increases customer system costs such as power, computational and communication resources. Besides, federated recommendation systems are also susceptible to model attacks and data poisoning by participating malicious clients. Therefore, most customers are unwilling to participate in federated recommendation without any incentive. To address these problems, we propose a blockchain-based federated recommendation system with incentive mechanism to promote more trustworthy, secure, and efficient federated recommendation service. First, we construct a federated recommendation system based on NeuMF and FedAvg. Then we introduce a reverse auction mechanism to select optimal clients that can maximize the social surplus. Finally, we employ blockchain for on-chain evidence storage of models to ensure the safety of the federated recommendation system. The experimental results show that our proposed incentive mechanism can attract clients with superior training data to engage in the federal recommendation at a lower cost, which can increase the economic benefit of federal recommendation by 54.9\% while improve the recommendation performance. Thus our work provides theoretical and technological support for the construction of a harmonious and healthy ecological environment for the application of federal recommendation.
comment: This paper has been accepted on 2024 Blockchain and Web3 Technology Innovation and Application Exchange Conference (BWTAC 2024)
♻ ☆ rerankers: A Lightweight Python Library to Unify Ranking Methods
This paper presents rerankers, a Python library which provides an easy-to-use interface to the most commonly used re-ranking approaches. Re-ranking is an integral component of many retrieval pipelines; however, there exist numerous approaches to it, relying on different implementation methods. rerankers unifies these methods into a single user-friendly interface, allowing practitioners and researchers alike to explore different methods while only changing a single line of Python code. Moreover ,rerankers ensures that its implementations are done with the fewest dependencies possible, and re-uses the original implementation whenever possible, guaranteeing that our simplified interface results in no performance degradation compared to more complex ones. The full source code and list of supported models are updated regularly and available at https://github.com/answerdotai/rerankers.
♻ ☆ Impedance vs. Power Side-channel Vulnerabilities: A Comparative Study
In recent times, impedance side-channel analysis has emerged as a potent strategy for adversaries seeking to extract sensitive information from computing systems. It leverages variations in the intrinsic impedance of a chip's internal structure across different logic states. In this study, we conduct a comparative analysis between the newly explored impedance side channel and the well-established power side channel. Through experimental evaluation, we investigate the efficacy of these two side channels in extracting the cryptographic key from the Advanced Encryption Standard (AES) and analyze their performance. Our results indicate that impedance analysis demonstrates a higher potential for cryptographic key extraction compared to power side-channel analysis. Moreover, we identify scenarios where power side-channel analysis does not yield satisfactory results, whereas impedance analysis proves to be more robust and effective. This work not only underscores the significance of impedance side-channel analysis in enhancing cryptographic security but also emphasizes the necessity for a deeper understanding of its mechanisms and implications.
Machine Learning
☆ Double Machine Learning at Scale to Predict Causal Impact of Customer Actions ECML
Causal Impact (CI) of customer actions are broadly used across the industry to inform both short- and long-term investment decisions of various types. In this paper, we apply the double machine learning (DML) methodology to estimate the CI values across 100s of customer actions of business interest and 100s of millions of customers. We operationalize DML through a causal ML library based on Spark with a flexible, JSON-driven model configuration approach to estimate CI at scale (i.e., across hundred of actions and millions of customers). We outline the DML methodology and implementation, and associated benefits over the traditional potential outcomes based CI model. We show population-level as well as customer-level CI values along with confidence intervals. The validation metrics show a 2.2% gain over the baseline methods and a 2.5X gain in the computational time. Our contribution is to advance the scalable application of CI, while also providing an interface that allows faster experimentation, cross-platform support, ability to onboard new use cases, and improves accessibility of underlying code for partner teams.
comment: 16 pages, 11 figures. Accepted at the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD) 2023, Turin, Italy
☆ Generative Principal Component Regression via Variational Inference
The ability to manipulate complex systems, such as the brain, to modify specific outcomes has far-reaching implications, particularly in the treatment of psychiatric disorders. One approach to designing appropriate manipulations is to target key features of predictive models. While generative latent variable models, such as probabilistic principal component analysis (PPCA), is a powerful tool for identifying targets, they struggle incorporating information relevant to low-variance outcomes into the latent space. When stimulation targets are designed on the latent space in such a scenario, the intervention can be suboptimal with minimal efficacy. To address this problem, we develop a novel objective based on supervised variational autoencoders (SVAEs) that enforces such information is represented in the latent space. The novel objective can be used with linear models, such as PPCA, which we refer to as generative principal component regression (gPCR). We show in simulations that gPCR dramatically improves target selection in manipulation as compared to standard PCR and SVAEs. As part of these simulations, we develop a metric for detecting when relevant information is not properly incorporated into the loadings. We then show in two neural datasets related to stress and social behavior in which gPCR dramatically outperforms PCR in predictive performance and that SVAEs exhibit low incorporation of relevant information into the loadings. Overall, this work suggests that our method significantly improves target selection for manipulation using latent variable models over competitor inference schemes.
☆ TimeDiT: General-purpose Diffusion Transformers for Time Series Foundation Model ICML 2024
With recent advances in building foundation models for texts and video data, there is a surge of interest in foundation models for time series. A family of models have been developed, utilizing a temporal auto-regressive generative Transformer architecture, whose effectiveness has been proven in Large Language Models. While the empirical results are promising, almost all existing time series foundation models have only been tested on well-curated ``benchmark'' datasets very similar to texts. However, real-world time series exhibit unique challenges, such as variable channel sizes across domains, missing values, and varying signal sampling intervals due to the multi-resolution nature of real-world data. Additionally, the uni-directional nature of temporally auto-regressive decoding limits the incorporation of domain knowledge, such as physical laws expressed as partial differential equations (PDEs). To address these challenges, we introduce the Time Diffusion Transformer (TimeDiT), a general foundation model for time series that employs a denoising diffusion paradigm instead of temporal auto-regressive generation. TimeDiT leverages the Transformer architecture to capture temporal dependencies and employs diffusion processes to generate high-quality candidate samples without imposing stringent assumptions on the target distribution via novel masking schemes and a channel alignment strategy. Furthermore, we propose a finetuning-free model editing strategy that allows the seamless integration of external knowledge during the sampling process without updating any model parameters. Extensive experiments conducted on a varity of tasks such as forecasting, imputation, and anomaly detection, demonstrate the effectiveness of TimeDiT.
comment: 23 Pages, 6 Figures, 11 Tables. First present at ICML 2024 Workshop on Foundation Models in the Wild
☆ On the Benefits of Memory for Modeling Time-Dependent PDEs
Data-driven techniques have emerged as a promising alternative to traditional numerical methods for solving partial differential equations (PDEs). These techniques frequently offer a better trade-off between computational cost and accuracy for many PDE families of interest. For time-dependent PDEs, existing methodologies typically treat PDEs as Markovian systems, i.e., the evolution of the system only depends on the ``current state'', and not the past states. However, distortion of the input signals -- e.g., due to discretization or low-pass filtering -- can render the evolution of the distorted signals non-Markovian. In this work, motivated by the Mori-Zwanzig theory of model reduction, we investigate the impact of architectures with memory for modeling PDEs: that is, when past states are explicitly used to predict the future. We introduce Memory Neural Operator (MemNO), a network based on the recent SSM architectures and Fourier Neural Operator (FNO). We empirically demonstrate on a variety of PDE families of interest that when the input is given on a low-resolution grid, MemNO significantly outperforms the baselines without memory, achieving more than 6 times less error on unseen PDEs. Via a combination of theory and experiments, we show that the effect of memory is particularly significant when the solution of the PDE has high frequency Fourier components (e.g., low-viscosity fluid dynamics), and it also increases robustness to observation noise.
☆ QID$^2$: An Image-Conditioned Diffusion Model for Q-space Up-sampling of DWI Data MICCAI 2024
We propose an image-conditioned diffusion model to estimate high angular resolution diffusion weighted imaging (DWI) from a low angular resolution acquisition. Our model, which we call QID$^2$, takes as input a set of low angular resolution DWI data and uses this information to estimate the DWI data associated with a target gradient direction. We leverage a U-Net architecture with cross-attention to preserve the positional information of the reference images, further guiding the target image generation. We train and evaluate QID$^2$ on single-shell DWI samples curated from the Human Connectome Project (HCP) dataset. Specifically, we sub-sample the HCP gradient directions to produce low angular resolution DWI data and train QID$^2$ to reconstruct the missing high angular resolution samples. We compare QID$^2$ with two state-of-the-art GAN models. Our results demonstrate that QID$^2$ not only achieves higher-quality generated images, but it consistently outperforms the GAN models in downstream tensor estimation across multiple metrics. Taken together, this study highlights the potential of diffusion models, and QID$^2$ in particular, for q-space up-sampling, thus offering a promising toolkit for clinical and research applications.
comment: Accepted at MICCAI 2024 International Workshop on Computational Diffusion MRI. Zijian Chen and Jueqi Wang contributed equally to this work
☆ A Lesion-aware Edge-based Graph Neural Network for Predicting Language Ability in Patients with Post-stroke Aphasia MICCAI 2024
We propose a lesion-aware graph neural network (LEGNet) to predict language ability from resting-state fMRI (rs-fMRI) connectivity in patients with post-stroke aphasia. Our model integrates three components: an edge-based learning module that encodes functional connectivity between brain regions, a lesion encoding module, and a subgraph learning module that leverages functional similarities for prediction. We use synthetic data derived from the Human Connectome Project (HCP) for hyperparameter tuning and model pretraining. We then evaluate the performance using repeated 10-fold cross-validation on an in-house neuroimaging dataset of post-stroke aphasia. Our results demonstrate that LEGNet outperforms baseline deep learning methods in predicting language ability. LEGNet also exhibits superior generalization ability when tested on a second in-house dataset that was acquired under a slightly different neuroimaging protocol. Taken together, the results of this study highlight the potential of LEGNet in effectively learning the relationships between rs-fMRI connectivity and language ability in a patient cohort with brain lesions for improved post-stroke aphasia evaluation.
comment: Accepted at MICCAI 2024 International Workshop on Machine Learning in Clinical Neuroimaging (MLCN)
☆ K-Origins: Better Colour Quantification for Neural Networks
K-Origins is a neural network layer designed to improve image-based network performances when learning colour, or intensities, is beneficial. Over 250 encoder-decoder convolutional networks are trained and tested on 16-bit synthetic data, demonstrating that K-Origins improves semantic segmentation accuracy in two scenarios: object detection with low signal-to-noise ratios, and segmenting multiple objects that are identical in shape but vary in colour. K-Origins generates output features from the input features, $\textbf{X}$, by the equation $\textbf{Y}_k = \textbf{X}-\textbf{J}\cdot w_k$ for each trainable parameter $w_k$, where $\textbf{J}$ is a matrix of ones. Additionally, networks with varying receptive fields were trained to determine optimal network depths based on the dimensions of target classes, suggesting that receptive field lengths should exceed object sizes. By ensuring a sufficient receptive field length and incorporating K-Origins, we can achieve better semantic network performance.
comment: 16 pages, 13 figures, 1 table
☆ Reinforcement Learning-enabled Satellite Constellation Reconfiguration and Retasking for Mission-Critical Applications
The development of satellite constellation applications is rapidly advancing due to increasing user demands, reduced operational costs, and technological advancements. However, a significant gap in the existing literature concerns reconfiguration and retasking issues within satellite constellations, which is the primary focus of our research. In this work, we critically assess the impact of satellite failures on constellation performance and the associated task requirements. To facilitate this analysis, we introduce a system modeling approach for GPS satellite constellations, enabling an investigation into performance dynamics and task distribution strategies, particularly in scenarios where satellite failures occur during mission-critical operations. Additionally, we introduce reinforcement learning (RL) techniques, specifically Q-learning, Policy Gradient, Deep Q-Network (DQN), and Proximal Policy Optimization (PPO), for managing satellite constellations, addressing the challenges posed by reconfiguration and retasking following satellite failures. Our results demonstrate that DQN and PPO achieve effective outcomes in terms of average rewards, task completion rates, and response times.
comment: Accepted for publication in the IEEE Military Communications Conference (IEEE MILCOM 2024)
♻ ☆ PID Accelerated Temporal Difference Algorithms
Long-horizon tasks, which have a large discount factor, pose a challenge for most conventional reinforcement learning (RL) algorithms. Algorithms such as Value Iteration and Temporal Difference (TD) learning have a slow convergence rate and become inefficient in these tasks. When the transition distributions are given, PID VI was recently introduced to accelerate the convergence of Value Iteration using ideas from control theory. Inspired by this, we introduce PID TD Learning and PID Q-Learning algorithms for the RL setting, in which only samples from the environment are available. We give a theoretical analysis of the convergence of PID TD Learning and its acceleration compared to the conventional TD Learning. We also introduce a method for adapting PID gains in the presence of noise and empirically verify its effectiveness.
♻ ☆ Improving Rare Word Translation With Dictionaries and Attention Masking
In machine translation, rare words continue to be a problem for the dominant encoder-decoder architecture, especially in low-resource and out-of-domain translation settings. Human translators solve this problem with monolingual or bilingual dictionaries. In this paper, we propose appending definitions from a bilingual dictionary to source sentences and using attention masking to link together rare words with their definitions. We find that including definitions for rare words improves performance by up to 1.0 BLEU and 1.6 MacroF1.
comment: 11 pages, 3 figures, 3 tables. Accepted at AMTA 2024
♻ ☆ Low-Rank Quantization-Aware Training for LLMs
Large language models (LLMs) are omnipresent, however their practical deployment is challenging due to their ever increasing computational and memory demands. Quantization is one of the most effective ways to make them more compute and memory efficient. Quantization-aware training (QAT) methods, generally produce the best quantized performance, however it comes at the cost of potentially long training time and excessive memory usage, making it impractical when applying for LLMs. Inspired by parameter-efficient fine-tuning (PEFT) and low-rank adaptation (LoRA) literature, we propose LR-QAT -- a lightweight and memory-efficient QAT algorithm for LLMs. LR-QAT employs several components to save memory without sacrificing predictive performance: (a) low-rank auxiliary weights that are aware of the quantization grid; (b) a downcasting operator using fixed-point or double-packed integers and (c) checkpointing. Unlike most related work, our method (i) is inference-efficient, leading to no additional overhead compared to traditional PTQ; (ii) can be seen as a general extended pretraining framework, meaning that the resulting model can still be utilized for any downstream task afterwards; (iii) can be applied across a wide range of quantization settings, such as different choices quantization granularity, activation quantization, and seamlessly combined with many PTQ techniques. We apply LR-QAT to LLaMA-1/2/3 and Mistral model families and validate its effectiveness on several downstream tasks. Our method outperforms common post-training quantization (PTQ) approaches and reaches the same model performance as full-model QAT at the fraction of its memory usage. Specifically, we can train a 7B LLM on a single consumer grade GPU with 24GB of memory. Our source code is available at https://github.com/qualcomm-ai-research/LR-QAT
♻ ☆ Force-Guided Bridge Matching for Full-Atom Time-Coarsened Dynamics of Peptides
Molecular Dynamics (MD) simulations are irreplaceable and ubiquitous in fields of materials science, chemistry, pharmacology just to name a few. Conventional MD simulations are plagued by numerical stability as well as long equilibration time issues, which limits broader applications of MD simulations. Recently, a surge of deep learning approaches have been devised for time-coarsened dynamics, which learns the state transition mechanism over much larger time scales to overcome these limitations. However, only a few methods target the underlying Boltzmann distribution by resampling techniques, where proposals are rarely accepted as new states with low efficiency. In this work, we propose a force-guided bridge matching model, FBM, a novel framework that first incorporates physical priors into bridge matching for full-atom time-coarsened dynamics. With the guidance of our well-designed intermediate force field, FBM is feasible to target the Boltzmann-like distribution by direct inference without extra steps. Experiments on small peptides verify our superiority in terms of comprehensive metrics and demonstrate transferability to unseen peptide systems.
♻ ☆ Verifiable cloud-based variational quantum algorithms
Variational quantum algorithms (VQAs) have shown potential for quantum advantage with noisy intermediate-scale quantum (NISQ) devices for quantum machine learning (QML). However, given the high cost and limited availability of quantum resources, delegating VQAs via cloud networks is a more practical solution for clients with limited quantum capabilities. Recently, Shingu et al.[Physical Review A, 105, 022603 (2022)] proposed a variational secure cloud quantum computing protocol, utilizing ancilla-driven quantum computation (ADQC) for cloud-based VQAs with minimal quantum resource consumption. However, their protocol lacks verifiability, which exposes it to potential malicious behaviors by the server. Additionally, channel loss requires frequent re-delegation as the size of the delegated variational circuit grows, complicating verification due to increased circuit complexity. This paper introduces a new protocol to address these challenges and enhance both verifiability and tolerance to channel loss in cloud-based VQAs.
♻ ☆ Bayesian Learning in a Nonlinear Multiscale State-Space Model
The ubiquity of multiscale interactions in complex systems is well-recognized, with development and heredity serving as a prime example of how processes at different temporal scales influence one another. This work introduces a novel multiscale state-space model to explore the dynamic interplay between systems interacting across different time scales, with feedback between each scale. We propose a Bayesian learning framework to estimate unknown states by learning the unknown process noise covariances within this multiscale model. We develop a Particle Gibbs with Ancestor Sampling (PGAS) algorithm for inference and demonstrate through simulations the efficacy of our approach.
comment: Corrected a typo
♻ ☆ Foundation Models for Music: A Survey
In recent years, foundation models (FMs) such as large language models (LLMs) and latent diffusion models (LDMs) have profoundly impacted diverse sectors, including music. This comprehensive review examines state-of-the-art (SOTA) pre-trained models and foundation models in music, spanning from representation learning, generative learning and multimodal learning. We first contextualise the significance of music in various industries and trace the evolution of AI in music. By delineating the modalities targeted by foundation models, we discover many of the music representations are underexplored in FM development. Then, emphasis is placed on the lack of versatility of previous methods on diverse music applications, along with the potential of FMs in music understanding, generation and medical application. By comprehensively exploring the details of the model pre-training paradigm, architectural choices, tokenisation, finetuning methodologies and controllability, we emphasise the important topics that should have been well explored, like instruction tuning and in-context learning, scaling law and emergent ability, as well as long-sequence modelling etc. A dedicated section presents insights into music agents, accompanied by a thorough analysis of datasets and evaluations essential for pre-training and downstream tasks. Finally, by underscoring the vital importance of ethical considerations, we advocate that following research on FM for music should focus more on such issues as interpretability, transparency, human responsibility, and copyright issues. The paper offers insights into future challenges and trends on FMs for music, aiming to shape the trajectory of human-AI collaboration in the music realm.
♻ ☆ Different Victims, Same Layout: Email Visual Similarity Detection for Enhanced Email Protection CCS 2024
In the pursuit of an effective spam detection system, the focus has often been on identifying known spam patterns either through rule-based detection systems or machine learning (ML) solutions that rely on keywords. However, both systems are susceptible to evasion techniques and zero-day attacks that can be achieved at low cost. Therefore, an email that bypassed the defense system once can do it again in the following days, even though rules are updated or the ML models are retrained. The recurrence of failures to detect emails that exhibit layout similarities to previously undetected spam is concerning for customers and can erode their trust in a company. Our observations show that threat actors reuse email kits extensively and can bypass detection with little effort, for example, by making changes to the content of emails. In this work, we propose an email visual similarity detection approach, named Pisco, to improve the detection capabilities of an email threat defense system. We apply our proof of concept to some real-world samples received from different sources. Our results show that email kits are being reused extensively and visually similar emails are sent to our customers at various time intervals. Therefore, this method could be very helpful in situations where detection features that rely on textual features and keywords are bypassed, an occurrence our observations show happens frequently.
comment: To be published in the proceedings of the ACM Conference on Computer and Communications Security (ACM CCS 2024)
♻ ☆ On the Convergence of Gradient Descent for Large Learning Rates
A vast literature on convergence guarantees for gradient descent and derived methods exists at the moment. However, a simple practical situation remains unexplored: when a fixed step size is used, can we expect gradient descent to converge starting from any initialization? We provide fundamental impossibility results showing that convergence becomes impossible no matter the initialization if the step size gets too big. Looking at the asymptotic value of the gradient norm along the optimization trajectory, we see that there is a phase transition as the step size crosses a critical value. This has been observed by practitioners, yet the true mechanisms through which this happens remain unclear beyond heuristics. Using results from dynamical systems theory, we provide a proof of this in the case of linear neural networks with a squared loss. We also prove the impossibility of convergence for more general losses without requiring strong assumptions such as Lipschitz continuity for the gradient. We validate our findings through experiments with non-linear networks.
♻ ☆ On the Federated Learning Framework for Cooperative Perception
Cooperative perception is essential to enhance the efficiency and safety of future transportation systems, requiring extensive data sharing among vehicles on the road, which raises significant privacy concerns. Federated learning offers a promising solution by enabling data privacy-preserving collaborative enhancements in perception, decision-making, and planning among connected and autonomous vehicles (CAVs). However, federated learning is impeded by significant challenges arising from data heterogeneity across diverse clients, potentially diminishing model accuracy and prolonging convergence periods. This study introduces a specialized federated learning framework for CP, termed the federated dynamic weighted aggregation (FedDWA) algorithm, facilitated by dynamic adjusting loss (DALoss) function. This framework employs dynamic client weighting to direct model convergence and integrates a novel loss function that utilizes Kullback-Leibler divergence (KLD) to counteract the detrimental effects of non-independently and identically distributed (Non-IID) and unbalanced data. Utilizing the BEV transformer as the primary model, our rigorous testing on the OpenV2V dataset, augmented with FedBEVT data, demonstrates significant improvements in the average intersection over union (IoU). These results highlight the substantial potential of our federated learning framework to address data heterogeneity challenges in CP, thereby enhancing the accuracy of environmental perception models and facilitating more robust and efficient collaborative learning solutions in the transportation sector.
comment: accepted by IEEE RA-L
♻ ☆ An embedding-based distance for temporal graphs
Temporal graphs are commonly used to represent time-resolved relations between entities in many natural and artificial systems. Many techniques were devised to investigate the evolution of temporal graphs by comparing their state at different time points. However, quantifying the similarity between temporal graphs as a whole is an open problem. Here, we use embeddings based on time-respecting random walks to introduce a new notion of distance between temporal graphs. This distance is well-defined for pairs of temporal graphs with different numbers of nodes and different time spans. We study the case of a matched pair of graphs, when a known relation exists between their nodes, and the case of unmatched graphs, when such a relation is unavailable and the graphs may be of different sizes. We use empirical and synthetic temporal network data to show that the distance we introduce discriminates graphs with different topological and temporal properties. We provide an efficient implementation of the distance computation suitable for large-scale temporal graphs.
♻ ☆ Heterogeneity-Informed Meta-Parameter Learning for Spatiotemporal Time Series Forecasting KDD'24
Spatiotemporal time series forecasting plays a key role in a wide range of real-world applications. While significant progress has been made in this area, fully capturing and leveraging spatiotemporal heterogeneity remains a fundamental challenge. Therefore, we propose a novel Heterogeneity-Informed Meta-Parameter Learning scheme. Specifically, our approach implicitly captures spatiotemporal heterogeneity through learning spatial and temporal embeddings, which can be viewed as a clustering process. Then, a novel spatiotemporal meta-parameter learning paradigm is proposed to learn spatiotemporal-specific parameters from meta-parameter pools, which is informed by the captured heterogeneity. Based on these ideas, we develop a Heterogeneity-Informed Spatiotemporal Meta-Network (HimNet) for spatiotemporal time series forecasting. Extensive experiments on five widely-used benchmarks demonstrate our method achieves state-of-the-art performance while exhibiting superior interpretability. Our code is available at https://github.com/XDZhelheim/HimNet.
comment: Published in KDD'24 Research Track
♻ ☆ FairX: A comprehensive benchmarking tool for model analysis using fairness, utility, and explainability
We present FairX, an open-source Python-based benchmarking tool designed for the comprehensive analysis of models under the umbrella of fairness, utility, and eXplainability (XAI). FairX enables users to train benchmarking bias-mitigation models and evaluate their fairness using a wide array of fairness metrics, data utility metrics, and generate explanations for model predictions, all within a unified framework. Existing benchmarking tools do not have the way to evaluate synthetic data generated from fair generative models, also they do not have the support for training fair generative models either. In FairX, we add fair generative models in the collection of our fair-model library (pre-processing, in-processing, post-processing) and evaluation metrics for evaluating the quality of synthetic fair data. This version of FairX supports both tabular and image datasets. It also allows users to provide their own custom datasets. The open-source FairX benchmarking package is publicly available at \url{https://github.com/fahim-sikder/FairX}.
♻ ☆ Behavioral Learning of Dish Rinsing and Scrubbing based on Interruptive Direct Teaching Considering Assistance Rate
Robots are expected to manipulate objects in a safe and dexterous way. For example, washing dishes is a dexterous operation that involves scrubbing the dishes with a sponge and rinsing them with water. It is necessary to learn it safely without splashing water and without dropping the dishes. In this study, we propose a safe and dexterous manipulation system. The robot learns a dynamics model of the object by estimating the state of the object and the robot itself, the control input, and the amount of human assistance required (assistance rate) after the human corrects the initial trajectory of the robot's hands by interruptive direct teaching. By backpropagating the error between the estimated and the reference value using the acquired dynamics model, the robot can generate a control input that approaches the reference value, for example, so that human assistance is not required and the dish does not move excessively. This allows for adaptive rinsing and scrubbing of dishes with unknown shapes and properties. As a result, it is possible to generate safe actions that require less human assistance.
comment: Accepted at Advanced Robotics
♻ ☆ Towards Explainable Traffic Flow Prediction with Large Language Models
Traffic forecasting is crucial for intelligent transportation systems. It has experienced significant advancements thanks to the power of deep learning in capturing latent patterns of traffic data. However, recent deep-learning architectures require intricate model designs and lack an intuitive understanding of the mapping from input data to predicted results. Achieving both accuracy and explainability in traffic prediction models remains a challenge due to the complexity of traffic data and the inherent opacity of deep learning models. To tackle these challenges, we propose a Traffic flow Prediction model based on Large Language Models (LLMs) to generate explainable traffic predictions, named xTP-LLM. By transferring multi-modal traffic data into natural language descriptions, xTP-LLM captures complex time-series patterns and external factors from comprehensive traffic data. The LLM framework is fine-tuned using language-based instructions to align with spatial-temporal traffic flow data. Empirically, xTP-LLM shows competitive accuracy compared with deep learning baselines, while providing an intuitive and reliable explanation for predictions. This paper contributes to advancing explainable traffic prediction models and lays a foundation for future exploration of LLM applications in transportation. To the best of our knowledge, this is the first study to use LLM for explainable prediction of traffic flows.
comment: 31pages, 16 figures
♻ ☆ OceanGPT: A Large Language Model for Ocean Science Tasks ACL2024
Ocean science, which delves into the oceans that are reservoirs of life and biodiversity, is of great significance given that oceans cover over 70% of our planet's surface. Recently, advances in Large Language Models (LLMs) have transformed the paradigm in science. Despite the success in other domains, current LLMs often fall short in catering to the needs of domain experts like oceanographers, and the potential of LLMs for ocean science is under-explored. The intrinsic reasons are the immense and intricate nature of ocean data as well as the necessity for higher granularity and richness in knowledge. To alleviate these issues, we introduce OceanGPT, the first-ever large language model in the ocean domain, which is expert in various ocean science tasks. We also propose OceanGPT, a novel framework to automatically obtain a large volume of ocean domain instruction data, which generates instructions based on multi-agent collaboration. Additionally, we construct the first oceanography benchmark, OceanBench, to evaluate the capabilities of LLMs in the ocean domain. Though comprehensive experiments, OceanGPT not only shows a higher level of knowledge expertise for oceans science tasks but also gains preliminary embodied intelligence capabilities in ocean technology.
comment: ACL2024. Project Website: http://oceangpt.zjukg.cn/
♻ ☆ Statistical Context Detection for Deep Lifelong Reinforcement Learning
Context detection involves labeling segments of an online stream of data as belonging to different tasks. Task labels are used in lifelong learning algorithms to perform consolidation or other procedures that prevent catastrophic forgetting. Inferring task labels from online experiences remains a challenging problem. Most approaches assume finite and low-dimension observation spaces or a preliminary training phase during which task labels are learned. Moreover, changes in the transition or reward functions can be detected only in combination with a policy, and therefore are more difficult to detect than changes in the input distribution. This paper presents an approach to learning both policies and labels in an online deep reinforcement learning setting. The key idea is to use distance metrics, obtained via optimal transport methods, i.e., Wasserstein distance, on suitable latent action-reward spaces to measure distances between sets of data points from past and current streams. Such distances can then be used for statistical tests based on an adapted Kolmogorov-Smirnov calculation to assign labels to sequences of experiences. A rollback procedure is introduced to learn multiple policies by ensuring that only the appropriate data is used to train the corresponding policy. The combination of task detection and policy deployment allows for the optimization of lifelong reinforcement learning agents without an oracle that provides task labels. The approach is tested using two benchmarks and the results show promising performance when compared with related context detection algorithms. The results suggest that optimal transport statistical methods provide an explainable and justifiable procedure for online context detection and reward optimization in lifelong reinforcement learning.
comment: 10 pages excluding references and bibliography. Accepted at CoLLAs 2024
♻ ☆ Enhancing Cell Tracking with a Time-Symmetric Deep Learning Approach
The accurate tracking of live cells using video microscopy recordings remains a challenging task for popular state-of-the-art image processing based object tracking methods. In recent years, several existing and new applications have attempted to integrate deep-learning based frameworks for this task, but most of them still heavily rely on consecutive frame based tracking embedded in their architecture or other premises that hinder generalized learning. To address this issue, we aimed to develop a new deep-learning based tracking method that relies solely on the assumption that cells can be tracked based on their spatio-temporal neighborhood, without restricting it to consecutive frames. The proposed method has the additional benefit that the motion patterns of the cells can be learned completely by the predictor without any prior assumptions, and it has the potential to handle a large number of video frames with heavy artifacts. The efficacy of the proposed method is demonstrated through biologically motivated validation strategies and compared against multiple state-of-the-art cell tracking methods.
♻ ☆ Controllable Edge-Type-Specific Interpretation in Multi-Relational Graph Neural Networks for Drug Response Prediction
Graph Neural Networks have been widely applied in critical decision-making areas that demand interpretable predictions, leading to the flourishing development of interpretability algorithms. However, current graph interpretability algorithms tend to emphasize generality and often overlook biological significance, thereby limiting their applicability in predicting cancer drug responses. In this paper, we propose a novel post-hoc interpretability algorithm for cancer drug response prediction, CETExplainer, which incorporates a controllable edge-type-specific weighting mechanism. It considers the mutual information between subgraphs and predictions, proposing a structural scoring approach to provide fine-grained, biologically meaningful explanations for predictive models. We also introduce a method for constructing ground truth based on real-world datasets to quantitatively evaluate the proposed interpretability algorithm. Empirical analysis on the real-world dataset demonstrates that CETExplainer achieves superior stability and improves explanation quality compared to leading algorithms, thereby offering a robust and insightful tool for cancer drug prediction.
♻ ☆ GANs Conditioning Methods: A Survey
In recent years, Generative Adversarial Networks (GANs) have seen significant advancements, leading to their widespread adoption across various fields. The original GAN architecture enables the generation of images without any specific control over the content, making it an unconditional generation process. However, many practical applications require precise control over the generated output, which has led to the development of conditional GANs (cGANs) that incorporate explicit conditioning to guide the generation process. cGANs extend the original framework by incorporating additional information (conditions), enabling the generation of samples that adhere to that specific criteria. Various conditioning methods have been proposed, each differing in how they integrate the conditioning information into both the generator and the discriminator networks. In this work, we review the conditioning methods proposed for GANs, exploring the characteristics of each method and highlighting their unique mechanisms and theoretical foundations. Furthermore, we conduct a comparative analysis of these methods, evaluating their performance on various image datasets. Through these analyses, we aim to provide insights into the strengths and limitations of various conditioning techniques, guiding future research and application in generative modeling.
♻ ☆ A Survey on Stability of Learning with Limited Labelled Data and its Sensitivity to the Effects of Randomness
Learning with limited labelled data, such as prompting, in-context learning, fine-tuning, meta-learning or few-shot learning, aims to effectively train a model using only a small amount of labelled samples. However, these approaches have been observed to be excessively sensitive to the effects of uncontrolled randomness caused by non-determinism in the training process. The randomness negatively affects the stability of the models, leading to large variances in results across training runs. When such sensitivity is disregarded, it can unintentionally, but unfortunately also intentionally, create an imaginary perception of research progress. Recently, this area started to attract research attention and the number of relevant studies is continuously growing. In this survey, we provide a comprehensive overview of 415 papers addressing the effects of randomness on the stability of learning with limited labelled data. We distinguish between four main tasks addressed in the papers (investigate/evaluate; determine; mitigate; benchmark/compare/report randomness effects), providing findings for each one. Furthermore, we identify and discuss seven challenges and open problems together with possible directions to facilitate further research. The ultimate goal of this survey is to emphasise the importance of this growing research area, which so far has not received an appropriate level of attention, and reveal impactful directions for future research.
comment: Accepted to ACM Comput. Surv. 2024
♻ ☆ Efficient Heterogeneous Graph Learning via Random Projection
Heterogeneous Graph Neural Networks (HGNNs) are powerful tools for deep learning on heterogeneous graphs. Typical HGNNs require repetitive message passing during training, limiting efficiency for large-scale real-world graphs. Recent pre-computation-based HGNNs use one-time message passing to transform a heterogeneous graph into regular-shaped tensors, enabling efficient mini-batch training. Existing pre-computation-based HGNNs can be mainly categorized into two styles, which differ in how much information loss is allowed and efficiency. We propose a hybrid pre-computation-based HGNN, named Random Projection Heterogeneous Graph Neural Network (RpHGNN), which combines the benefits of one style's efficiency with the low information loss of the other style. To achieve efficiency, the main framework of RpHGNN consists of propagate-then-update iterations, where we introduce a Random Projection Squashing step to ensure that complexity increases only linearly. To achieve low information loss, we introduce a Relation-wise Neighbor Collection component with an Even-odd Propagation Scheme, which aims to collect information from neighbors in a finer-grained way. Experimental results indicate that our approach achieves state-of-the-art results on seven small and large benchmark datasets while also being 230% faster compared to the most effective baseline. Surprisingly, our approach not only surpasses pre-processing-based baselines but also outperforms end-to-end methods.
comment: Accepted by IEEE Transactions on Knowledge and Data Engineering (TKDE)
♻ ☆ KTO: Model Alignment as Prospect Theoretic Optimization ICML 2024
Kahneman & Tversky's $\textit{prospect theory}$ tells us that humans perceive random variables in a biased but well-defined manner (1992); for example, humans are famously loss-averse. We show that objectives for aligning LLMs with human feedback implicitly incorporate many of these biases -- the success of these objectives (e.g., DPO) over cross-entropy minimization can partly be ascribed to them belonging to a family of loss functions that we call $\textit{human-aware losses}$ (HALOs). However, the utility functions these methods attribute to humans still differ from those in the prospect theory literature. Using a Kahneman-Tversky model of human utility, we propose a HALO that directly maximizes the utility of generations instead of maximizing the log-likelihood of preferences, as current methods do. We call this approach KTO, and it matches or exceeds the performance of preference-based methods at scales from 1B to 30B, despite only learning from a binary signal of whether an output is desirable. More broadly, our work suggests that there is no one HALO that is universally superior; the best loss depends on the inductive biases most appropriate for a given setting, an oft-overlooked consideration.
comment: ICML 2024
♻ ☆ End-to-end Feature Selection Approach for Learning Skinny Trees AISTATS 2024
We propose a new optimization-based approach for feature selection in tree ensembles, an important problem in statistics and machine learning. Popular tree ensemble toolkits e.g., Gradient Boosted Trees and Random Forests support feature selection post-training based on feature importance scores, while very popular, they are known to have drawbacks. We propose Skinny Trees: an end-to-end toolkit for feature selection in tree ensembles where we train a tree ensemble while controlling the number of selected features. Our optimization-based approach learns an ensemble of differentiable trees, and simultaneously performs feature selection using a grouped $\ell_0$-regularizer. We use first-order methods for optimization and present convergence guarantees for our approach. We use a dense-to-sparse regularization scheduling scheme that can lead to more expressive and sparser tree ensembles. On 15 synthetic and real-world datasets, Skinny Trees can achieve $1.5\!\times\! -~620~\!\times\!$ feature compression rates, leading up to $10\times$ faster inference over dense trees, without any loss in performance. Skinny Trees lead to superior feature selection than many existing toolkits e.g., in terms of AUC performance for 25\% feature budget, Skinny Trees outperforms LightGBM by $10.2\%$ (up to $37.7\%$), and Random Forests by $3\%$ (up to $12.5\%$).
comment: Accepted in AISTATS 2024
♻ ☆ Fair Mixed Effects Support Vector Machine
To ensure unbiased and ethical automated predictions, fairness must be a core principle in machine learning applications. Fairness in machine learning aims to mitigate biases present in the training data and model imperfections that could lead to discriminatory outcomes. This is achieved by preventing the model from making decisions based on sensitive characteristics like ethnicity or sexual orientation. A fundamental assumption in machine learning is the independence of observations. However, this assumption often does not hold true for data describing social phenomena, where data points are often clustered based. Hence, if the machine learning models do not account for the cluster correlations, the results may be biased. Especially high is the bias in cases where the cluster assignment is correlated to the variable of interest. We present a fair mixed effects support vector machine algorithm that can handle both problems simultaneously. With a reproducible simulation study we demonstrate the impact of clustered data on the quality of fair machine learning predictions.
comment: 14 pages, 6 figures
♻ ☆ White-Box Transformers via Sparse Rate Reduction: Compression Is All There Is?
In this paper, we contend that a natural objective of representation learning is to compress and transform the distribution of the data, say sets of tokens, towards a low-dimensional Gaussian mixture supported on incoherent subspaces. The goodness of such a representation can be evaluated by a principled measure, called sparse rate reduction, that simultaneously maximizes the intrinsic information gain and extrinsic sparsity of the learned representation. From this perspective, popular deep network architectures, including transformers, can be viewed as realizing iterative schemes to optimize this measure. Particularly, we derive a transformer block from alternating optimization on parts of this objective: the multi-head self-attention operator compresses the representation by implementing an approximate gradient descent step on the coding rate of the features, and the subsequent multi-layer perceptron sparsifies the features. This leads to a family of white-box transformer-like deep network architectures, named CRATE, which are mathematically fully interpretable. We show, by way of a novel connection between denoising and compression, that the inverse to the aforementioned compressive encoding can be realized by the same class of CRATE architectures. Thus, the so-derived white-box architectures are universal to both encoders and decoders. Experiments show that these networks, despite their simplicity, indeed learn to compress and sparsify representations of large-scale real-world image and text datasets, and achieve performance very close to highly engineered transformer-based models: ViT, MAE, DINO, BERT, and GPT2. We believe the proposed computational framework demonstrates great potential in bridging the gap between theory and practice of deep learning, from a unified perspective of data compression. Code is available at: https://ma-lab-berkeley.github.io/CRATE .
comment: Accepted at Journal of Machine Learning Research. This paper integrates the works arXiv:2306.01129 and arXiv:2308.16271 into a complete story. In this paper, we improve the writing and organization, and also add conceptual, empirical, and theoretical improvements over the previous work. V2: small typo fixes and formatting improvements. V3: improvements from journal revisions
♻ ☆ Contrastive Learning and Abstract Concepts: The Case of Natural Numbers
Contrastive Learning (CL) has been successfully applied to classification and other downstream tasks related to concrete concepts, such as objects contained in the ImageNet dataset. No attempts seem to have been made so far in applying this promising scheme to more abstract entities. A prominent example of these could be the concept of (discrete) Quantity. CL can be frequently interpreted as a self-supervised scheme guided by some profound and ubiquitous conservation principle (e.g. conservation of identity in object classification tasks). In this introductory work we apply a suitable conservation principle to the semi-abstract concept of natural numbers by which discrete quantities can be estimated or predicted. We experimentally show, by means of a toy problem, that contrastive learning can be trained to count at a glance with high accuracy both at human as well as at super-human ranges.. We compare this with the results of a trained-to-count at a glance supervised learning (SL) neural network scheme of similar architecture. We show that both schemes exhibit similar good performance on baseline experiments, where the distributions of the training and testing stages are equal. Importantly, we demonstrate that in some generalization scenarios, where training and testing distributions differ, CL boasts more robust and much better error performance.
♻ ☆ NeMo-Aligner: Scalable Toolkit for Efficient Model Alignment
Aligning Large Language Models (LLMs) with human values and preferences is essential for making them helpful and safe. However, building efficient tools to perform alignment can be challenging, especially for the largest and most competent LLMs which often contain tens or hundreds of billions of parameters. We create NeMo-Aligner, a toolkit for model alignment that can efficiently scale to a thousand GPUs for training the largest open-source LLMs such as Nemotron 4 340B and Llama 3.1 405B. NeMo-Aligner comes with highly optimized and scalable implementations for major paradigms of model alignment such as: Reinforcement Learning from Human Feedback (RLHF), Direct Preference Optimization (DPO), SteerLM, and Self-Play Fine-Tuning (SPIN). Additionally, our toolkit supports running most of the alignment techniques in a Parameter Efficient Fine-Tuning (PEFT) setting. NeMo-Aligner is designed for extensibility, allowing support for other alignment techniques with minimal effort. It is open-sourced with Apache 2.0 License and we invite community contributions at https://github.com/NVIDIA/NeMo-Aligner
comment: 16 pages, 4 figures, Accepted to COLM 2024
♻ ☆ On the Optimality of Misspecified Spectral Algorithms
In the misspecified spectral algorithms problem, researchers usually assume the underground true function $f_{\rho}^{*} \in [\mathcal{H}]^{s}$, a less-smooth interpolation space of a reproducing kernel Hilbert space (RKHS) $\mathcal{H}$ for some $s\in (0,1)$. The existing minimax optimal results require $\|f_{\rho}^{*}\|_{L^{\infty}}<\infty$ which implicitly requires $s > \alpha_{0}$ where $\alpha_{0}\in (0,1)$ is the embedding index, a constant depending on $\mathcal{H}$. Whether the spectral algorithms are optimal for all $s\in (0,1)$ is an outstanding problem lasting for years. In this paper, we show that spectral algorithms are minimax optimal for any $\alpha_{0}-\frac{1}{\beta} < s < 1$, where $\beta$ is the eigenvalue decay rate of $\mathcal{H}$. We also give several classes of RKHSs whose embedding index satisfies $ \alpha_0 = \frac{1}{\beta} $. Thus, the spectral algorithms are minimax optimal for all $s\in (0,1)$ on these RKHSs.
comment: 50 pages, 2 figures
♻ ☆ Safety Constrained Multi-Agent Reinforcement Learning for Active Voltage Control IJCAI2024
Active voltage control presents a promising avenue for relieving power congestion and enhancing voltage quality, taking advantage of the distributed controllable generators in the power network, such as roof-top photovoltaics. While Multi-Agent Reinforcement Learning (MARL) has emerged as a compelling approach to address this challenge, existing MARL approaches tend to overlook the constrained optimization nature of this problem, failing in guaranteeing safety constraints. In this paper, we formalize the active voltage control problem as a constrained Markov game and propose a safety-constrained MARL algorithm. We expand the primal-dual optimization RL method to multi-agent settings, and augment it with a novel approach of double safety estimation to learn the policy and to update the Lagrange-multiplier. In addition, we proposed different cost functions and investigated their influences on the behavior of our constrained MARL method. We evaluate our approach in the power distribution network simulation environment with real-world scale scenarios. Experimental results demonstrate the effectiveness of the proposed method compared with the state-of-the-art MARL methods. This paper is published at \url{https://www.ijcai.org/Proceedings/2024/}.
comment: Accepted by IJCAI2024
♻ ☆ Towards reliable respiratory disease diagnosis based on cough sounds and vision transformers
Recent advancements in deep learning techniques have sparked performance boosts in various real-world applications including disease diagnosis based on multi-modal medical data. Cough sound data-based respiratory disease (e.g., COVID-19 and Chronic Obstructive Pulmonary Disease) diagnosis has also attracted much attention. However, existing works usually utilise traditional machine learning or deep models of moderate scales. On the other hand, the developed approaches are trained and evaluated on small-scale data due to the difficulty of curating and annotating clinical data on scale. To address these issues in prior works, we create a unified framework to evaluate various deep models from lightweight Convolutional Neural Networks (e.g., ResNet18) to modern vision transformers and compare their performance in respiratory disease classification. Based on the observations from such an extensive empirical study, we propose a novel approach to cough-based disease classification based on both self-supervised and supervised learning on a large-scale cough data set. Experimental results demonstrate our proposed approach outperforms prior arts consistently on two benchmark datasets for COVID-19 diagnosis and a proprietary dataset for COPD/non-COPD classification with an AUROC of 92.5%.
♻ ☆ TimeSeriesBench: An Industrial-Grade Benchmark for Time Series Anomaly Detection Models
Time series anomaly detection (TSAD) has gained significant attention due to its real-world applications to improve the stability of modern software systems. However, there is no effective way to verify whether they can meet the requirements for real-world deployment. Firstly, current algorithms typically train a specific model for each time series. Maintaining such many models is impractical in a large-scale system with tens of thousands of curves. The performance of using merely one unified model to detect anomalies remains unknown. Secondly, most TSAD models are trained on the historical part of a time series and are tested on its future segment. In distributed systems, however, there are frequent system deployments and upgrades, with new, previously unseen time series emerging daily. The performance of testing newly incoming unseen time series on current TSAD algorithms remains unknown. Lastly, the assumptions of the evaluation metrics in existing benchmarks are far from practical demands. To solve the above-mentioned problems, we propose an industrial-grade benchmark TimeSeriesBench. We assess the performance of existing algorithms across more than 168 evaluation settings and provide comprehensive analysis for the future design of anomaly detection algorithms. An industrial dataset is also released along with TimeSeriesBench.
comment: Accepted by ISSRE'24
♻ ☆ Investigating Recurrent Transformers with Dynamic Halt
In this paper, we comprehensively study the inductive biases of two major approaches to augmenting Transformers with a recurrent mechanism: (1) the approach of incorporating a depth-wise recurrence similar to Universal Transformers; and (2) the approach of incorporating a chunk-wise temporal recurrence like Temporal Latent Bottleneck. Furthermore, we propose and investigate novel ways to extend and combine the above methods - for example, we propose a global mean-based dynamic halting mechanism for Universal Transformers and an augmentation of Temporal Latent Bottleneck with elements from Universal Transformer. We compare the models and probe their inductive biases in several diagnostic tasks, such as Long Range Arena (LRA), flip-flop language modeling, ListOps, and Logical Inference. The code is released in: https://github.com/JRC1995/InvestigatingRecurrentTransformers/tree/main
♻ ☆ OccamLLM: Fast and Exact Language Model Arithmetic in a Single Step
Despite significant advancements in text generation and reasoning, Large Language Models (LLMs) still face challenges in accurately performing complex arithmetic operations. Language model systems often enable LLMs to generate code for arithmetic operations to achieve accurate calculations. However, this approach compromises speed and security, and fine-tuning risks the language model losing prior capabilities. We propose a framework that enables exact arithmetic in a single autoregressive step, providing faster, more secure, and more interpretable LLM systems with arithmetic capabilities. We use the hidden states of a LLM to control a symbolic architecture that performs arithmetic. Our implementation using Llama 3 with OccamNet as a symbolic model (OccamLlama) achieves 100\% accuracy on single arithmetic operations ($+,-,\times,\div,\sin{},\cos{},\log{},\exp{},\sqrt{}$), outperforming GPT 4o with and without a code interpreter. Furthermore, OccamLlama outperforms GPT 4o with and without a code interpreter on average across a range of mathematical problem solving benchmarks, demonstrating that OccamLLMs can excel in arithmetic tasks, even surpassing much larger models. We will make our code public shortly.
♻ ☆ MPruner: Optimizing Neural Network Size with CKA-Based Mutual Information Pruning
Determining the optimal size of a neural network is critical, as it directly impacts runtime performance and memory usage. Pruning is a well-established model compression technique that reduces the size of neural networks while mathematically guaranteeing accuracy preservation. However, many recent pruning methods overlook the global contributions of individual model components, making it difficult to ensure that a pruned model meets the desired dataset and performance requirements. To address these challenges, we developed a new pruning algorithm, MPruner, that leverages mutual information through vector similarity. MPruner utilizes layer clustering with the Centered Kernel Alignment (CKA) similarity metric, allowing us to incorporate global information from the neural network for more precise and efficient layer-wise pruning. We evaluated MPruner across various architectures and configurations, demonstrating its versatility and providing practical guidelines. MPruner achieved up to a 50% reduction in parameters and memory usage for CNN and transformer-based models, with minimal to no loss in accuracy.
♻ ☆ Recursively Feasible Probabilistic Safe Online Learning with Control Barrier Functions
Learning-based control has recently shown great efficacy in performing complex tasks for various applications. However, to deploy it in real systems, it is of vital importance to guarantee the system will stay safe. Control Barrier Functions (CBFs) offer mathematical tools for designing safety-preserving controllers for systems with known dynamics. In this article, we first introduce a model-uncertainty-aware reformulation of CBF-based safety-critical controllers using Gaussian Process (GP) regression to close the gap between an approximate mathematical model and the real system, which results in a second-order cone program (SOCP)-based control design. We then present the pointwise feasibility conditions of the resulting safety controller, highlighting the level of richness that the available system information must meet to ensure safety. We use these conditions to devise an event-triggered online data collection strategy that ensures the recursive feasibility of the learned safety controller. Our method works by constantly reasoning about whether the current information is sufficient to ensure safety or if new measurements under active safe exploration are required to reduce the uncertainty. As a result, our proposed framework can guarantee the forward invariance of the safe set defined by the CBF with high probability, even if it contains a priori unexplored regions. We validate the proposed framework in two numerical simulation experiments.
comment: Journal article. Includes the results of the 2021 CDC paper titled "Pointwise feasibility of gaussian process-based safety-critical control under model uncertainty" and proposes a recursively feasible safe online learning algorithm as new contribution
♻ ☆ The Responsible Foundation Model Development Cheatsheet: A Review of Tools & Resources
Foundation model development attracts a rapidly expanding body of contributors, scientists, and applications. To help shape responsible development practices, we introduce the Foundation Model Development Cheatsheet: a growing collection of 250+ tools and resources spanning text, vision, and speech modalities. We draw on a large body of prior work to survey resources (e.g. software, documentation, frameworks, guides, and practical tools) that support informed data selection, processing, and understanding, precise and limitation-aware artifact documentation, efficient model training, advance awareness of the environmental impact from training, careful model evaluation of capabilities, risks, and claims, as well as responsible model release, licensing and deployment practices. We hope this curated collection of resources helps guide more responsible development. The process of curating this list, enabled us to review the AI development ecosystem, revealing what tools are critically missing, misused, or over-used in existing practices. We find that (i) tools for data sourcing, model evaluation, and monitoring are critically under-serving ethical and real-world needs, (ii) evaluations for model safety, capabilities, and environmental impact all lack reproducibility and transparency, (iii) text and particularly English-centric analyses continue to dominate over multilingual and multi-modal analyses, and (iv) evaluation of systems, rather than just models, is needed so that capabilities and impact are assessed in context.
♻ ☆ Kolmogorov Arnold Networks in Fraud Detection: Bridging the Gap Between Theory and Practice
This study evaluates the applicability of Kolmogorov-Arnold Networks (KAN) in fraud detection, finding that their effectiveness is context-dependent. We propose a quick decision rule using Principal Component Analysis (PCA) to assess the suitability of KAN: if data can be effectively separated in two dimensions using splines, KAN may outperform traditional models; otherwise, other methods could be more appropriate. We also introduce a heuristic approach to hyperparameter tuning, significantly reducing computational costs. These findings suggest that while KAN has potential, its use should be guided by data-specific assessments.
♻ ☆ Deep-MacroFin: Informed Equilibrium Neural Network for Continuous Time Economic Models
In this paper, we present Deep-MacroFin, a comprehensive framework designed to solve partial differential equations, with a particular focus on models in continuous time economics. This framework leverages deep learning methodologies, including conventional Multi-Layer Perceptrons and the newly developed Kolmogorov-Arnold Networks. It is optimized using economic information encapsulated by Hamilton-Jacobi-Bellman equations and coupled algebraic equations. The application of neural networks holds the promise of accurately resolving high-dimensional problems with fewer computational demands and limitations compared to standard numerical methods. This versatile framework can be readily adapted for elementary differential equations, and systems of differential equations, even in cases where the solutions may exhibit discontinuities. Importantly, it offers a more straightforward and user-friendly implementation than existing libraries.
comment: 25 pages, 8 figures
♻ ☆ What do we know about Hugging Face? A systematic literature review and quantitative validation of qualitative claims
Background: Collaborative Software Package Registries (SPRs) are an integral part of the software supply chain. Much engineering work synthesizes SPR package into applications. Prior research has examined SPRs for traditional software, such as NPM (JavaScript) and PyPI (Python). Pre-Trained Model (PTM) Registries are an emerging class of SPR of increasing importance, because they support the deep learning supply chain. Aims: Recent empirical research has examined PTM registries in ways such as vulnerabilities, reuse processes, and evolution. However, no existing research synthesizes them to provide a systematic understanding of the current knowledge. Some of the existing research includes qualitative claims lacking quantitative analysis. Our research fills these gaps by providing a knowledge synthesis and quantitative analyses. Methods: We first conduct a systematic literature review (SLR). We then observe that some of the claims are qualitative. We identify quantifiable metrics associated with those claims, and measure in order to substantiate these claims. Results: From our SLR, we identify 12 claims about PTM reuse on the HuggingFace platform, 4 of which lack quantitative validation. We successfully test 3 of these claims through a quantitative analysis, and directly compare one with traditional software. Our findings corroborate qualitative claims with quantitative measurements. Our findings are: (1) PTMs have a much higher turnover rate than traditional software, indicating a dynamic and rapidly evolving reuse environment within the PTM ecosystem; and (2) There is a strong correlation between documentation quality and PTM popularity. Conclusions: We confirm qualitative research claims with concrete metrics, supporting prior qualitative and case study research. Our measures show further dynamics of PTM reuse, inspiring research infrastructure and new measures.
comment: [ESEM'24] Proceedings of the 18th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM) 2024
♻ ☆ Linear Contextual Bandits with Hybrid Payoff: Revisited ECML
We study the Linear Contextual Bandit problem in the hybrid reward setting. In this setting every arm's reward model contains arm specific parameters in addition to parameters shared across the reward models of all the arms. We can reduce this setting to two closely related settings (a) Shared - no arm specific parameters, and (b) Disjoint - only arm specific parameters, enabling the application of two popular state of the art algorithms - $\texttt{LinUCB}$ and $\texttt{DisLinUCB}$ (Algorithm 1 in (Li et al. 2010)). When the arm features are stochastic and satisfy a popular diversity condition, we provide new regret analyses for both algorithms, significantly improving on the known regret guarantees of these algorithms. Our novel analysis critically exploits the hybrid reward structure and the diversity condition. Moreover, we introduce a new algorithm $\texttt{HyLinUCB}$ that crucially modifies $\texttt{LinUCB}$ (using a new exploration coefficient) to account for sparsity in the hybrid setting. Under the same diversity assumptions, we prove that $\texttt{HyLinUCB}$ also incurs only $O(\sqrt{T})$ regret for $T$ rounds. We perform extensive experiments on synthetic and real-world datasets demonstrating strong empirical performance of $\texttt{HyLinUCB}$.For number of arm specific parameters much larger than the number of shared parameters, we observe that $\texttt{DisLinUCB}$ incurs the lowest regret. In this case, regret of $\texttt{HyLinUCB}$ is the second best and extremely competitive to $\texttt{DisLinUCB}$. In all other situations, including our real-world dataset, $\texttt{HyLinUCB}$ has significantly lower regret than $\texttt{LinUCB}$, $\texttt{DisLinUCB}$ and other SOTA baselines we considered. We also empirically observe that the regret of $\texttt{HyLinUCB}$ grows much slower with the number of arms compared to baselines, making it suitable even for very large action spaces.
comment: Accepted at ECML PKDD 2024 as a Research Track Paper
♻ ☆ Persian Slang Text Conversion to Formal and Deep Learning of Persian Short Texts on Social Media for Sentiment Classification
The lack of a suitable tool for the analysis of conversational texts in the Persian language has made various analyses of these texts, including Sentiment Analysis, difficult. In this research, we tried to make the understanding of these texts easier for the machine by providing PSC, Persian Slang Converter, a tool for converting conversational texts into formal ones, and by using the most up-to-date and best deep learning methods along with the PSC, the sentiment learning of short Persian language texts for the machine in a better way. be made More than 10 million unlabeled texts from various social networks and movie subtitles (as Conversational texts) and about 10 million news texts (as formal texts) have been used for training unsupervised models and formal implementation of the tool. 60,000 texts from the comments of Instagram social network users with positive, negative, and neutral labels are considered supervised data for training the emotion classification model of short texts. Using the formal tool, 57% of the words of the corpus of conversation were converted. Finally, by using the formalizer, FastText model, and deep LSTM network, an accuracy of 81.91 was obtained on the test data.
comment: 16 pages, 4 figures, 14 tables
♻ ☆ Projected Stochastic Gradient Descent with Quantum Annealed Binary Gradients
We present, QP-SBGD, a novel layer-wise stochastic optimiser tailored towards training neural networks with binary weights, known as binary neural networks (BNNs), on quantum hardware. BNNs reduce the computational requirements and energy consumption of deep learning models with minimal loss in accuracy. However, training them in practice remains to be an open challenge. Most known BNN-optimisers either rely on projected updates or binarise weights post-training. Instead, QP-SBGD approximately maps the gradient onto binary variables, by solving a quadratic constrained binary optimisation. Under practically reasonable assumptions, we show that this update rule converges with a rate of $\mathcal{O}(1 / \sqrt{T})$. Moreover, we show how the $\mathcal{NP}$-hard projection can be effectively executed on an adiabatic quantum annealer, harnessing recent advancements in quantum computation. We also introduce a projected version of this update rule and prove that if a fixed point exists in the binary variable space, the modified updates will converge to it. Last but not least, our algorithm is implemented layer-wise, making it suitable to train larger networks on resource-limited quantum hardware. Through extensive evaluations, we show that QP-SBGD outperforms or is on par with competitive and well-established baselines such as BinaryConnect, signSGD and ProxQuant when optimising the Rosenbrock function, training BNNs as well as binary graph neural networks.
Multimedia
☆ LSTMSE-Net: Long Short Term Speech Enhancement Network for Audio-visual Speech Enhancement
In this paper, we propose long short term memory speech enhancement network (LSTMSE-Net), an audio-visual speech enhancement (AVSE) method. This innovative method leverages the complementary nature of visual and audio information to boost the quality of speech signals. Visual features are extracted with VisualFeatNet (VFN), and audio features are processed through an encoder and decoder. The system scales and concatenates visual and audio features, then processes them through a separator network for optimized speech enhancement. The architecture highlights advancements in leveraging multi-modal data and interpolation techniques for robust AVSE challenge systems. The performance of LSTMSE-Net surpasses that of the baseline model from the COG-MHEAR AVSE Challenge 2024 by a margin of 0.06 in scale-invariant signal-to-distortion ratio (SISDR), $0.03$ in short-time objective intelligibility (STOI), and $1.32$ in perceptual evaluation of speech quality (PESQ). The source code of the proposed LSTMSE-Net is available at \url{https://github.com/mtanveer1/AVSEC-3-Challenge}.
☆ Unveiling Deep Shadows: A Survey on Image and Video Shadow Detection, Removal, and Generation in the Era of Deep Learning
Shadows are formed when light encounters obstacles, leading to areas of diminished illumination. In computer vision, shadow detection, removal, and generation are crucial for enhancing scene understanding, refining image quality, ensuring visual consistency in video editing, and improving virtual environments. This paper presents a comprehensive survey of shadow detection, removal, and generation in images and videos within the deep learning landscape over the past decade, covering tasks, deep models, datasets, and evaluation metrics. Our key contributions include a comprehensive survey of shadow analysis, standardization of experimental comparisons, exploration of the relationships among model size, speed, and performance, a cross-dataset generalization study, identification of open issues and future directions, and provision of publicly available resources to support further research.
comment: Publicly available results, trained models, and evaluation metrics at https://github.com/xw-hu/Unveiling-Deep-Shadows
☆ Towards Real-World Adverse Weather Image Restoration: Enhancing Clearness and Semantics with Vision-Language Models ECCV 2024
This paper addresses the limitations of adverse weather image restoration approaches trained on synthetic data when applied to real-world scenarios. We formulate a semi-supervised learning framework employing vision-language models to enhance restoration performance across diverse adverse weather conditions in real-world settings. Our approach involves assessing image clearness and providing semantics using vision-language models on real data, serving as supervision signals for training restoration models. For clearness enhancement, we use real-world data, utilizing a dual-step strategy with pseudo-labels assessed by vision-language models and weather prompt learning. For semantic enhancement, we integrate real-world data by adjusting weather conditions in vision-language model descriptions while preserving semantic meaning. Additionally, we introduce an effective training strategy to bootstrap restoration performance. Our approach achieves superior results in real-world adverse weather image restoration, demonstrated through qualitative and quantitative comparisons with state-of-the-art works.
comment: Accepted by ECCV 2024
☆ Low-Resolution Face Recognition via Adaptable Instance-Relation Distillation IJCNN 2024
Low-resolution face recognition is a challenging task due to the missing of informative details. Recent approaches based on knowledge distillation have proven that high-resolution clues can well guide low-resolution face recognition via proper knowledge transfer. However, due to the distribution difference between training and testing faces, the learned models often suffer from poor adaptability. To address that, we split the knowledge transfer process into distillation and adaptation steps, and propose an adaptable instance-relation distillation approach to facilitate low-resolution face recognition. In the approach, the student distills knowledge from high-resolution teacher in both instance level and relation level, providing sufficient cross-resolution knowledge transfer. Then, the learned student can be adaptable to recognize low-resolution faces with adaptive batch normalization in inference. In this manner, the capability of recovering missing details of familiar low-resolution faces can be effectively enhanced, leading to a better knowledge transfer. Extensive experiments on low-resolution face recognition clearly demonstrate the effectiveness and adaptability of our approach.
comment: Accepted by IJCNN 2024
☆ PRoGS: Progressive Rendering of Gaussian Splats
Over the past year, 3D Gaussian Splatting (3DGS) has received significant attention for its ability to represent 3D scenes in a perceptually accurate manner. However, it can require a substantial amount of storage since each splat's individual data must be stored. While compression techniques offer a potential solution by reducing the memory footprint, they still necessitate retrieving the entire scene before any part of it can be rendered. In this work, we introduce a novel approach for progressively rendering such scenes, aiming to display visible content that closely approximates the final scene as early as possible without loading the entire scene into memory. This approach benefits both on-device rendering applications limited by memory constraints and streaming applications where minimal bandwidth usage is preferred. To achieve this, we approximate the contribution of each Gaussian to the final scene and construct an order of prioritization on their inclusion in the rendering process. Additionally, we demonstrate that our approach can be combined with existing compression methods to progressively render (and stream) 3DGS scenes, optimizing bandwidth usage by focusing on the most important splats within a scene. Overall, our work establishes a foundation for making remotely hosted 3DGS content more quickly accessible to end-users in over-the-top consumption scenarios, with our results showing significant improvements in quality across all metrics compared to existing methods.
☆ Privacy-Preserving Multimedia Mobile Cloud Computing Using Protective Perturbation
Mobile cloud computing has been adopted in many multimedia applications, where the resource-constrained mobile device sends multimedia data (e.g., images) to remote cloud servers to request computation-intensive multimedia services (e.g., image recognition). While significantly improving the performance of the mobile applications, the cloud-based mechanism often causes privacy concerns as the multimedia data and services are offloaded from the trusted user device to untrusted cloud servers. Several recent studies have proposed perturbation-based privacy preserving mechanisms, which obfuscate the offloaded multimedia data to eliminate privacy exposures without affecting the functionality of the remote multimedia services. However, the existing privacy protection approaches require the deployment of computation-intensive perturbation generation on the resource-constrained mobile devices. Also, the obfuscated images are typically not compliant with the standard image compression algorithms and suffer from significant bandwidth consumption. In this paper, we develop a novel privacy-preserving multimedia mobile cloud computing framework, namely $PMC^2$, to address the resource and bandwidth challenges. $PMC^2$ employs secure confidential computing in the cloud to deploy the perturbation generator, which addresses the resource challenge while maintaining the privacy. Furthermore, we develop a neural compressor specifically trained to compress the perturbed images in order to address the bandwidth challenge. We implement $PMC^2$ in an end-to-end mobile cloud computing system, based on which our evaluations demonstrate superior latency, power efficiency, and bandwidth consumption achieved by $PMC^2$ while maintaining high accuracy in the target multimedia service.
☆ Think Twice Before Recognizing: Large Multimodal Models for General Fine-grained Traffic Sign Recognition
We propose a new strategy called think twice before recognizing to improve fine-grained traffic sign recognition (TSR). Fine-grained TSR in the wild is difficult due to the complex road conditions, and existing approaches particularly struggle with cross-country TSR when data is lacking. Our strategy achieves effective fine-grained TSR by stimulating the multiple-thinking capability of large multimodal models (LMM). We introduce context, characteristic, and differential descriptions to design multiple thinking processes for the LMM. The context descriptions with center coordinate prompt optimization help the LMM to locate the target traffic sign in the original road images containing multiple traffic signs and filter irrelevant answers through the proposed prior traffic sign hypothesis. The characteristic description is based on few-shot in-context learning of template traffic signs, which decreases the cross-domain difference and enhances the fine-grained recognition capability of the LMM. The differential descriptions of similar traffic signs optimize the multimodal thinking capability of the LMM. The proposed method is independent of training data and requires only simple and uniform instructions. We conducted extensive experiments on three benchmark datasets and two real-world datasets from different countries, and the proposed method achieves state-of-the-art TSR results on all five datasets.
♻ ☆ TALDS-Net: Task-Aware Adaptive Local Descriptors Selection for Few-shot Image Classification ICASSP 2024
Few-shot image classification aims to classify images from unseen novel classes with few samples. Recent works demonstrate that deep local descriptors exhibit enhanced representational capabilities compared to image-level features. However, most existing methods solely rely on either employing all local descriptors or directly utilizing partial descriptors, potentially resulting in the loss of crucial information. Moreover, these methods primarily emphasize the selection of query descriptors while overlooking support descriptors. In this paper, we propose a novel Task-Aware Adaptive Local Descriptors Selection Network (TALDS-Net), which exhibits the capacity for adaptive selection of task-aware support descriptors and query descriptors. Specifically, we compare the similarity of each local support descriptor with other local support descriptors to obtain the optimal support descriptor subset and then compare the query descriptors with the optimal support subset to obtain discriminative query descriptors. Extensive experiments demonstrate that our TALDS-Net outperforms state-of-the-art methods on both general and fine-grained datasets.
comment: 4 pages, 1 figures, is accepted by ICASSP 2024
♻ ☆ IDNet: A Novel Dataset for Identity Document Analysis and Fraud Detection
Effective fraud detection and analysis of government-issued identity documents, such as passports, driver's licenses, and identity cards, are essential in thwarting identity theft and bolstering security on online platforms. The training of accurate fraud detection and analysis tools depends on the availability of extensive identity document datasets. However, current publicly available benchmark datasets for identity document analysis, including MIDV-500, MIDV-2020, and FMIDV, fall short in several respects: they offer a limited number of samples, cover insufficient varieties of fraud patterns, and seldom include alterations in critical personal identifying fields like portrait images, limiting their utility in training models capable of detecting realistic frauds while preserving privacy. In response to these shortcomings, our research introduces a new benchmark dataset, IDNet, designed to advance privacy-preserving fraud detection efforts. The IDNet dataset comprises 837,060 images of synthetically generated identity documents, totaling approximately 490 gigabytes, categorized into 20 types from $10$ U.S. states and 10 European countries. We evaluate the utility and present use cases of the dataset, illustrating how it can aid in training privacy-preserving fraud detection methods, facilitating the generation of camera and video capturing of identity documents, and testing schema unification and other identity document management functionalities.
comment: 40 pages
Computation and Language
☆ DiversityMedQA: Assessing Demographic Biases in Medical Diagnosis using Large Language Models
As large language models (LLMs) gain traction in healthcare, concerns about their susceptibility to demographic biases are growing. We introduce {DiversityMedQA}, a novel benchmark designed to assess LLM responses to medical queries across diverse patient demographics, such as gender and ethnicity. By perturbing questions from the MedQA dataset, which comprises medical board exam questions, we created a benchmark that captures the nuanced differences in medical diagnosis across varying patient profiles. Our findings reveal notable discrepancies in model performance when tested against these demographic variations. Furthermore, to ensure the perturbations were accurate, we also propose a filtering strategy that validates each perturbation. By releasing DiversityMedQA, we provide a resource for evaluating and mitigating demographic bias in LLM medical diagnoses.
☆ The Compressor-Retriever Architecture for Language Model OS
Recent advancements in large language models (LLMs) have significantly enhanced their capacity to aggregate and process information across multiple modalities, enabling them to perform a wide range of tasks such as multimodal data querying, tool usage, web interactions, and handling long documents. These capabilities pave the way for transforming LLMs from mere chatbots into general-purpose agents capable of interacting with the real world. This paper explores the concept of using a language model as the core component of an operating system (OS), effectively acting as a CPU that processes data stored in a context window, which functions as RAM. A key challenge in realizing such an LM OS is managing the life-long context and ensuring statefulness across sessions, a feature limited by the current session-based interaction paradigm due to context window size limit. To address this, we introduce compressor-retriever, a model-agnostic architecture designed for life-long context management. Unlike other long-context solutions such as retrieval-augmented generation, our approach exclusively uses the base model's forward function to compress and retrieve context, ensuring end-to-end differentiability. Preliminary experiments demonstrate the effectiveness of this architecture in in-context learning tasks, marking a step towards the development of a fully stateful LLM OS. Project repo available at: https://github.com/gblackout/LM-OS
☆ Revisiting SMoE Language Models by Evaluating Inefficiencies with Task Specific Expert Pruning
Sparse Mixture of Expert (SMoE) models have emerged as a scalable alternative to dense models in language modeling. These models use conditionally activated feedforward subnetworks in transformer blocks, allowing for a separation between total model parameters and per-example computation. However, large token-routed SMoE models face a significant challenge: during inference, the entire model must be used for a sequence or a batch, resulting in high latencies in a distributed setting that offsets the advantages of per-token sparse activation. Our research explores task-specific model pruning to inform decisions about designing SMoE architectures, mainly modulating the choice of expert counts in pretraining. We investigate whether such pruned models offer advantages over smaller SMoE models trained from scratch, when evaluating and comparing them individually on tasks. To that end, we introduce an adaptive task-aware pruning technique UNCURL to reduce the number of experts per MoE layer in an offline manner post-training. Our findings reveal a threshold pruning factor for the reduction that depends on the number of experts used in pretraining, above which, the reduction starts to degrade model performance. These insights contribute to our understanding of model design choices when pretraining with SMoE architectures, particularly useful when considering task-specific inference optimization for later stages.
☆ Masked Mixers for Language Generation and Retrieval
Attention mechanisms that confer selective focus on a strict subset of input elements are nearly ubiquitous in language models today. We posit there to be downside to the use of attention: most information present in the input is necessarily lost. In support of this idea we observe poor input representation accuracy in transformers, but find more accurate representation in what we term masked mixers which replace self-attention with masked convolutions. Applied to TinyStories the masked mixer learns causal language tasks more efficiently than early transformer implementations and somewhat less efficiently than optimized, current implementations. The most efficient learning algorithm observed for this dataset is a transformer-masked mixer hybrid, suggesting that these models learn in an orthogonal manner. We hypothesized that the information loss exhibited by transformers would be much more detrimental to retrieval than generation, and to test this we introduce an efficient training approach for retrieval models based on existing generative model embeddings. With this method, embeddings from masked mixers are found to result in far better summary-to-story retrieval compared to embeddings from transformers.
comment: 23 pages, 15 figures (11 primary, 4 supplementary)
☆ PoliPrompt: A High-Performance Cost-Effective LLM-Based Text Classification Framework for Political Science
Recent advancements in large language models (LLMs) have opened new avenues for enhancing text classification efficiency in political science, surpassing traditional machine learning methods that often require extensive feature engineering, human labeling, and task-specific training. However, their effectiveness in achieving high classification accuracy remains questionable. This paper introduces a three-stage in-context learning approach that leverages LLMs to improve classification accuracy while minimizing experimental costs. Our method incorporates automatic enhanced prompt generation, adaptive exemplar selection, and a consensus mechanism that resolves discrepancies between two weaker LLMs, refined by an advanced LLM. We validate our approach using datasets from the BBC news reports, Kavanaugh Supreme Court confirmation, and 2018 election campaign ads. The results show significant improvements in classification F1 score (+0.36 for zero-shot classification) with manageable economic costs (-78% compared with human labeling), demonstrating that our method effectively addresses the limitations of traditional machine learning while offering a scalable and reliable solution for text analysis in political science.
comment: 23 pages, 5 figures
☆ Efficient and Scalable Estimation of Tool Representations in Vector Space
Recent advancements in function calling and tool use have significantly enhanced the capabilities of large language models (LLMs) by enabling them to interact with external information sources and execute complex tasks. However, the limited context window of LLMs presents challenges when a large number of tools are available, necessitating efficient methods to manage prompt length and maintain accuracy. Existing approaches, such as fine-tuning LLMs or leveraging their reasoning capabilities, either require frequent retraining or incur significant latency overhead. A more efficient solution involves training smaller models to retrieve the most relevant tools for a given query, although this requires high quality, domain-specific data. To address those challenges, we present a novel framework for generating synthetic data for tool retrieval applications and an efficient data-driven tool retrieval strategy using small encoder models. Empowered by LLMs, we create ToolBank, a new tool retrieval dataset that reflects real human user usages. For tool retrieval methodologies, we propose novel approaches: (1) Tool2Vec: usage-driven tool embedding generation for tool retrieval, (2) ToolRefiner: a staged retrieval method that iteratively improves the quality of retrieved tools, and (3) MLC: framing tool retrieval as a multi-label classification problem. With these new methods, we achieve improvements of up to 27.28 in Recall@K on the ToolBench dataset and 30.5 in Recall@K on ToolBank. Additionally, we present further experimental results to rigorously validate our methods. Our code is available at \url{https://github.com/SqueezeAILab/Tool2Vec}
♻ ☆ Dissociation of Faithful and Unfaithful Reasoning in LLMs
Large language models (LLMs) often improve their performance in downstream tasks when they generate Chain of Thought reasoning text before producing an answer. We investigate how LLMs recover from errors in Chain of Thought. Through analysis of error recovery behaviors, we find evidence for unfaithfulness in Chain of Thought, which occurs when models arrive at the correct answer despite invalid reasoning text. We identify factors that shift LLM recovery behavior: LLMs recover more frequently from obvious errors and in contexts that provide more evidence for the correct answer. Critically, these factors have divergent effects on faithful and unfaithful recoveries. Our results indicate that there are distinct mechanisms driving faithful and unfaithful error recoveries. Selective targeting of these mechanisms may be able to drive down the rate of unfaithful reasoning and improve model interpretability.
comment: code published at https://github.com/CoTErrorRecovery/CoTErrorRecovery
♻ ☆ Manipulating Large Language Models to Increase Product Visibility
Large language models (LLMs) are increasingly being integrated into search engines to provide natural language responses tailored to user queries. Customers and end-users are also becoming more dependent on these models for quick and easy purchase decisions. In this work, we investigate whether recommendations from LLMs can be manipulated to enhance a product's visibility. We demonstrate that adding a strategic text sequence (STS) -- a carefully crafted message -- to a product's information page can significantly increase its likelihood of being listed as the LLM's top recommendation. To understand the impact of STS, we use a catalog of fictitious coffee machines and analyze its effect on two target products: one that seldom appears in the LLM's recommendations and another that usually ranks second. We observe that the strategic text sequence significantly enhances the visibility of both products by increasing their chances of appearing as the top recommendation. This ability to manipulate LLM-generated search responses provides vendors with a considerable competitive advantage and has the potential to disrupt fair market competition. Just as search engine optimization (SEO) revolutionized how webpages are customized to rank higher in search engine results, influencing LLM recommendations could profoundly impact content optimization for AI-driven search services. Code for our experiments is available at https://github.com/aounon/llm-rank-optimizer.
♻ ☆ Balancing Rigor and Utility: Mitigating Cognitive Biases in Large Language Models for Multiple-Choice Questions
This paper examines the role of cognitive biases in the decision-making processes of large language models (LLMs), challenging the conventional goal of eliminating all biases. We show that certain cognitive biases when properly balanced, can enhance decision-making efficiency through rational deviations and heuristic shortcuts. By introducing heuristic moderation and an abstention option, which allows LLMs to withhold responses when uncertain, we reduce error rates, improve decision accuracy, and optimize decision rates. Using the Balance Rigor and Utility (BRU) dataset, developed through expert collaboration, our findings demonstrate that targeted inspection of cognitive biases aligns LLM decisions more closely with human reasoning, enhancing reliability and suggesting strategies for future improvements. This approach offers a novel way to leverage cognitive biases to improve the practical utility of LLMs across various applications.
comment: This article is currently under review. All data will be open on GitHub once the review is complete. https://github.com/limanwang/Balancing-Rigor-and-Utility
♻ ☆ Eliciting Informative Text Evaluations with Large Language Models
Peer prediction mechanisms motivate high-quality feedback with provable guarantees. However, current methods only apply to rather simple reports, like multiple-choice or scalar numbers. We aim to broaden these techniques to the larger domain of text-based reports, drawing on the recent developments in large language models. This vastly increases the applicability of peer prediction mechanisms as textual feedback is the norm in a large variety of feedback channels: peer reviews, e-commerce customer reviews, and comments on social media. We introduce two mechanisms, the Generative Peer Prediction Mechanism (GPPM) and the Generative Synopsis Peer Prediction Mechanism (GSPPM). These mechanisms utilize LLMs as predictors, mapping from one agent's report to a prediction of her peer's report. Theoretically, we show that when the LLM prediction is sufficiently accurate, our mechanisms can incentivize high effort and truth-telling as an (approximate) Bayesian Nash equilibrium. Empirically, we confirm the efficacy of our mechanisms through experiments conducted on two real datasets: the Yelp review dataset and the ICLR OpenReview dataset. We highlight the results that on the ICLR dataset, our mechanisms can differentiate three quality levels -- human-written reviews, GPT-4-generated reviews, and GPT-3.5-generated reviews in terms of expected scores. Additionally, GSPPM penalizes LLM-generated reviews more effectively than GPPM.
comment: Accepted by the Twenty-Fifth ACM Conference on Economics and Computation (EC'24)
♻ ☆ Exploring Bias and Prediction Metrics to Characterise the Fairness of Machine Learning for Equity-Centered Public Health Decision-Making: A Narrative Review
Background: The rapid advancement of Machine Learning (ML) represents novel opportunities to enhance public health research, surveillance, and decision-making. However, there is a lack of comprehensive understanding of algorithmic bias, systematic errors in predicted population health outcomes, resulting from the public health application of ML. The objective of this narrative review is to explore the types of bias generated by ML and quantitative metrics to assess these biases. Methods : We performed search on PubMed, MEDLINE, IEEE (Institute of Electrical and Electronics Engineers), ACM (Association for Computing Machinery) Digital Library, Science Direct, and Springer Nature. We used keywords to identify studies describing types of bias and metrics to measure these in the domain of ML and public and population health published in English between 2008 and 2023, inclusive. Results: A total of 72 articles met the inclusion criteria. Our review identified the commonly described types of bias and quantitative metrics to assess these biases from an equity perspective. Conclusion : The review will help formalize the evaluation framework for ML on public health from an equity perspective.
comment: under review
♻ ☆ Domain-Specific Improvement on Psychotherapy Chatbot Using Assistant ICASSP 2024
Large language models (LLMs) have demonstrated impressive generalization capabilities on specific tasks with human-written instruction data. However, the limited quantity, diversity, and professional expertise of such instruction data raise concerns about the performance of LLMs in psychotherapy tasks when provided with domain-specific instructions. To address this, we firstly propose Domain-Specific Assistant Instructions based on AlexanderStreet therapy, and secondly, we use an adaption fine-tuning method and retrieval augmented generation method to improve pre-trained LLMs. Through quantitative evaluation of linguistic quality using automatic and human evaluation, we observe that pre-trained LLMs on Psychotherapy Assistant Instructions outperform state-of-the-art LLMs response baselines. Our Assistant-Instruction approach offers a half-annotation method to align pre-trained LLMs with instructions and provide pre-trained LLMs with more psychotherapy knowledge.
comment: Accepted at ICASSP 2024 EIHRC Workshop
♻ ☆ Exploring neural oscillations during speech perception via surrogate gradient spiking neural networks
Understanding cognitive processes in the brain demands sophisticated models capable of replicating neural dynamics at large scales. We present a physiologically inspired speech recognition architecture, compatible and scalable with deep learning frameworks, and demonstrate that end-to-end gradient descent training leads to the emergence of neural oscillations in the central spiking neural network. Significant cross-frequency couplings, indicative of these oscillations, are measured within and across network layers during speech processing, whereas no such interactions are observed when handling background noise inputs. Furthermore, our findings highlight the crucial inhibitory role of feedback mechanisms, such as spike frequency adaptation and recurrent connections, in regulating and synchronising neural activity to improve recognition performance. Overall, on top of developing our understanding of synchronisation phenomena notably observed in the human auditory pathway, our architecture exhibits dynamic and efficient information processing, with relevance to neuromorphic technology.
♻ ☆ Analyzing Diversity in Healthcare LLM Research: A Scientometric Perspective
The deployment of large language models (LLMs) in healthcare has demonstrated substantial potential for enhancing clinical decision-making, administrative efficiency, and patient outcomes. However, the underrepresentation of diverse groups in the development and application of these models can perpetuate biases, leading to inequitable healthcare delivery. This paper presents a comprehensive scientometric analysis of LLM research for healthcare, including data from January 1, 2021, to July 1, 2024. By analyzing metadata from PubMed and Dimensions, including author affiliations, countries, and funding sources, we assess the diversity of contributors to LLM research. Our findings highlight significant gender and geographic disparities, with a predominance of male authors and contributions primarily from high-income countries (HICs). We introduce a novel journal diversity index based on Gini diversity to measure the inclusiveness of scientific publications. Our results underscore the necessity for greater representation in order to ensure the equitable application of LLMs in healthcare. We propose actionable strategies to enhance diversity and inclusivity in artificial intelligence research, with the ultimate goal of fostering a more inclusive and equitable future in healthcare innovation.
♻ ☆ Sentiment Analysis Across Languages: Evaluation Before and After Machine Translation to English
People communicate in more than 7,000 languages around the world, with around 780 languages spoken in India alone. Despite this linguistic diversity, research on Sentiment Analysis has predominantly focused on English text data, resulting in a disproportionate availability of sentiment resources for English. This paper examines the performance of transformer models in Sentiment Analysis tasks across multilingual datasets and text that has undergone machine translation. By comparing the effectiveness of these models in different linguistic contexts, we gain insights into their performance variations and potential implications for sentiment analysis across diverse languages. We also discuss the shortcomings and potential for future work towards the end.
comment: 6 pages, 3 Figures
♻ ☆ ExtractGPT: Exploring the Potential of Large Language Models for Product Attribute Value Extraction
In order to facilitate features such as faceted product search and product comparison, e-commerce platforms require accurately structured product data, including precise attribute/value pairs. Vendors often times provide unstructured product descriptions consisting only of an offer title and a textual description. Consequently, extracting attribute values from titles and descriptions is vital for e-commerce platforms. State-of-the-art attribute value extraction methods based on pre-trained language models, such as BERT, face two drawbacks (i) the methods require significant amounts of task-specific training data and (ii) the fine-tuned models have problems with generalising to unseen attribute values that were not part of the training data. This paper explores the potential of using large language models as a more training data-efficient and more robust alternative to existing AVE methods. We propose prompt templates for describing the target attributes of the extraction to the LLM, covering both zero-shot and few-shot scenarios. In the zero-shot scenario, textual and JSON-based target schema representations of the attributes are compared. In the few-shot scenario, we investigate (i) the provision of example attribute values, (ii) the selection of in-context demonstrations, (iii) shuffled ensembling to prevent position bias, and (iv) fine-tuning the LLM. We evaluate the prompt templates in combination with hosted LLMs, such as GPT-3.5 and GPT-4, and open-source LLMs which can be run locally. We compare the performance of the LLMs to the PLM-based methods SU-OpenTag, AVEQA, and MAVEQA. The highest average F1-score of 86% was achieved by GPT-4. Llama-3-70B performs only 3% worse than GPT-4, making it a competitive open-source alternative. Given the same training data, this prompt/GPT-4 combination outperforms the best PLM baseline by an average of 6% F1-score.
♻ ☆ AMERICANO: Argument Generation with Discourse-driven Decomposition and Agent Interaction
Argument generation is a challenging task in natural language processing, which requires rigorous reasoning and proper content organization. Inspired by recent chain-of-thought prompting that breaks down a complex task into intermediate steps, we propose Americano, a novel framework with agent interaction for argument generation. Our approach decomposes the generation process into sequential actions grounded on argumentation theory, which first executes actions sequentially to generate argumentative discourse components, and then produces a final argument conditioned on the components. To further mimic the human writing process and improve the left-to-right generation paradigm of current autoregressive language models, we introduce an argument refinement module which automatically evaluates and refines argument drafts based on feedback received. We evaluate our framework on the task of counterargument generation using a subset of Reddit/CMV dataset. The results show that our method outperforms both end-to-end and chain-of-thought prompting methods and can generate more coherent and persuasive arguments with diverse and rich contents.
comment: INLG 2024
♻ ☆ Evaluating Large Language Models on Spatial Tasks: A Multi-Task Benchmarking Study
The advent of large language models such as ChatGPT, Gemini, and others has underscored the importance of evaluating their diverse capabilities, ranging from natural language understanding to code generation. However, their performance on spatial tasks has not been comprehensively assessed. This study addresses this gap by introducing a novel multi-task spatial evaluation dataset, designed to systematically explore and compare the performance of several advanced models on spatial tasks. The dataset encompasses twelve distinct task types, including spatial understanding and path planning, each with verified, accurate answers. We evaluated multiple models, including OpenAI's gpt-3.5-turbo, gpt-4o, and ZhipuAI's glm-4, through a two-phase testing approach. Initially, we conducted zero-shot testing, followed by categorizing the dataset by difficulty and performing prompt tuning tests. Results indicate that gpt-4o achieved the highest overall accuracy in the first phase, with an average of 71.3%. Although moonshot-v1-8k slightly underperformed overall, it surpassed gpt-4o in place name recognition tasks. The study also highlights the impact of prompt strategies on model performance in specific tasks. For example, the Chain-of-Thought (COT) strategy increased gpt-4o's accuracy in path planning from 12.4% to 87.5%, while a one-shot strategy enhanced moonshot-v1-8k's accuracy in mapping tasks from 10.1% to 76.3%.
♻ ☆ LiveFC: A System for Live Fact-Checking of Audio Streams
The advances in the digital era have led to rapid dissemination of information. This has also aggravated the spread of misinformation and disinformation. This has potentially serious consequences, such as civil unrest. While fact-checking aims to combat this, manual fact-checking is cumbersome and not scalable. While automated fact-checking approaches exist, they do not operate in real-time and do not always account for spread of misinformation through different modalities. This is particularly important as proactive fact-checking on live streams in real-time can help people be informed of false narratives and prevent catastrophic consequences that may cause civil unrest. This is particularly relevant with the rapid dissemination of information through video on social media platforms or other streams like political rallies and debates. Hence, in this work we develop a platform named LiveFC, that can aid in fact-checking live audio streams in real-time. LiveFC has a user-friendly interface that displays the claims detected along with their veracity and evidence for live streams with associated speakers for claims from respective segments. The app can be accessed at http://livefc.factiverse.ai and a screen recording of the demo can be found at https://bit.ly/3WVAoIw.
comment: Under Review, 11 pages
♻ ☆ Generalizing Fairness to Generative Language Models via Reformulation of Non-discrimination Criteria
Generative AI, such as large language models, has undergone rapid development within recent years. As these models become increasingly available to the public, concerns arise about perpetuating and amplifying harmful biases in applications. Gender stereotypes can be harmful and limiting for the individuals they target, whether they consist of misrepresentation or discrimination. Recognizing gender bias as a pervasive societal construct, this paper studies how to uncover and quantify the presence of gender biases in generative language models. In particular, we derive generative AI analogues of three well-known non-discrimination criteria from classification, namely independence, separation and sufficiency. To demonstrate these criteria in action, we design prompts for each of the criteria with a focus on occupational gender stereotype, specifically utilizing the medical test to introduce the ground truth in the generative AI context. Our results address the presence of occupational gender bias within such conversational language models.
♻ ☆ A Hybrid RAG System with Comprehensive Enhancement on Complex Reasoning KDD
Retrieval-augmented generation (RAG) is a framework enabling large language models (LLMs) to enhance their accuracy and reduce hallucinations by integrating external knowledge bases. In this paper, we introduce a hybrid RAG system enhanced through a comprehensive suite of optimizations that significantly improve retrieval quality, augment reasoning capabilities, and refine numerical computation ability. We refined the text chunks and tables in web pages, added attribute predictors to reduce hallucinations, conducted LLM Knowledge Extractor and Knowledge Graph Extractor, and finally built a reasoning strategy with all the references. We evaluated our system on the CRAG dataset through the Meta CRAG KDD Cup 2024 Competition. Both the local and online evaluations demonstrate that our system significantly enhances complex reasoning capabilities. In local evaluations, we have significantly improved accuracy and reduced error rates compared to the baseline model, achieving a notable increase in scores. In the meanwhile, we have attained outstanding results in online assessments, demonstrating the performance and generalization capabilities of the proposed system. The source code for our system is released in \url{https://gitlab.aicrowd.com/shizueyy/crag-new}.
comment: Technical report for 3rd prize in Task 1 of Meta CRAG KDD Cup 2024
♻ ☆ ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search
Recent methodologies in LLM self-training mostly rely on LLM generating responses and filtering those with correct output answers as training data. This approach often yields a low-quality fine-tuning training set (e.g., incorrect plans or intermediate reasoning). In this paper, we develop a reinforced self-training approach, called ReST-MCTS*, based on integrating process reward guidance with tree search MCTS* for collecting higher-quality reasoning traces as well as per-step value to train policy and reward models. ReST-MCTS* circumvents the per-step manual annotation typically used to train process rewards by tree-search-based reinforcement learning: Given oracle final correct answers, ReST-MCTS* is able to infer the correct process rewards by estimating the probability this step can help lead to the correct answer. These inferred rewards serve dual purposes: they act as value targets for further refining the process reward model and also facilitate the selection of high-quality traces for policy model self-training. We first show that the tree-search policy in ReST-MCTS* achieves higher accuracy compared with prior LLM reasoning baselines such as Best-of-N and Tree-of-Thought, within the same search budget. We then show that by using traces searched by this tree-search policy as training data, we can continuously enhance the three language models for multiple iterations, and outperform other self-training algorithms such as ReST$^\text{EM}$ and Self-Rewarding LM.
comment: 30 pages
♻ ☆ FastMem: Fast Memorization of Prompt Improves Context Awareness of Large Language Models
Large language models (LLMs) excel in generating coherent text, but they often struggle with context awareness, leading to inaccuracies in tasks requiring faithful adherence to provided information. We introduce FastMem, a novel method designed to enhance instruction fine-tuned LLMs' context awareness through fast memorization of the prompt. FastMem maximizes the likelihood of the prompt before inference by fine-tuning only the last Feed-Forward Network (FFN) module. This targeted approach ensures efficient optimization without overfitting, significantly improving the model's ability to comprehend and accurately follow the context. Our experiments demonstrate substantial gains in reading comprehension, text summarization and adherence to output structures. For instance, FastMem improves the accuracy of Llama 3-8B-Inst on the NQ-SWAP dataset from 59.1% to 71.6%, and reduces the output structure failure rate of Qwen 1.5-4B-Chat from 34.9% to 25.5%. Extensive experimental results highlight FastMem's potential to offer a robust solution to enhance the reliability and accuracy of LLMs in various applications. Our code is available at: https://github.com/IAAR-Shanghai/FastMem
♻ ☆ A Formal Perspective on Byte-Pair Encoding ACL 2023
Byte-Pair Encoding (BPE) is a popular algorithm used for tokenizing data in NLP, despite being devised initially as a compression method. BPE appears to be a greedy algorithm at face value, but the underlying optimization problem that BPE seeks to solve has not yet been laid down. We formalize BPE as a combinatorial optimization problem. Via submodular functions, we prove that the iterative greedy version is a $\frac{1}{{\sigma(\boldsymbol{\mu}^\star)}}(1-e^{-{\sigma(\boldsymbol{\mu}^\star)}})$-approximation of an optimal merge sequence, where ${\sigma(\boldsymbol{\mu}^\star)}$ is the total backward curvature with respect to the optimal merge sequence $\boldsymbol{\mu}^\star$. Empirically the lower bound of the approximation is $\approx 0.37$. We provide a faster implementation of BPE which improves the runtime complexity from $\mathcal{O}\left(N M\right)$ to $\mathcal{O}\left(N \log M\right)$, where $N$ is the sequence length and $M$ is the merge count. Finally, we optimize the brute-force algorithm for optimal BPE using memoization.
comment: ACL 2023
♻ ☆ Pitfalls and Outlooks in Using COMET
Since its introduction, the COMET metric has blazed a trail in the machine translation community, given its strong correlation with human judgements of translation quality. Its success stems from being a modified pre-trained multilingual model finetuned for quality assessment. However, it being a machine learning model also gives rise to a new set of pitfalls that may not be widely known. We investigate these unexpected behaviours from three aspects: 1) technical: obsolete software versions and compute precision; 2) data: empty content, language mismatch, and translationese at test time as well as distribution and domain biases in training; 3) usage and reporting: multi-reference support and model referencing in the literature. All of these problems imply that COMET scores is not comparable between papers or even technical setups and we put forward our perspective on fixing each issue. Furthermore, we release the SacreCOMET package that can generate a signature for the software and model configuration as well as an appropriate citation. The goal of this work is to help the community make more sound use of the COMET metric.
♻ ☆ AudioBench: A Universal Benchmark for Audio Large Language Models
We introduce AudioBench, a universal benchmark designed to evaluate Audio Large Language Models (AudioLLMs). It encompasses 8 distinct tasks and 26 datasets, among which, 7 are newly proposed datasets. The evaluation targets three main aspects: speech understanding, audio scene understanding, and voice understanding (paralinguistic). Despite recent advancements, there lacks a comprehensive benchmark for AudioLLMs on instruction following capabilities conditioned on audio signals. AudioBench addresses this gap by setting up datasets as well as desired evaluation metrics. Besides, we also evaluated the capabilities of five popular models and found that no single model excels consistently across all tasks. We outline the research outlook for AudioLLMs and anticipate that our open-sourced evaluation toolkit, data, and leaderboard will offer a robust testbed for future model developments.
comment: v3 - Abundent update on models and evaluation details; Code: https://github.com/AudioLLMs/AudioBench
♻ ☆ Contrasting Linguistic Patterns in Human and LLM-Generated News Text
We conduct a quantitative analysis contrasting human-written English news text with comparable large language model (LLM) output from six different LLMs that cover three different families and four sizes in total. Our analysis spans several measurable linguistic dimensions, including morphological, syntactic, psychometric, and sociolinguistic aspects. The results reveal various measurable differences between human and AI-generated texts. Human texts exhibit more scattered sentence length distributions, more variety of vocabulary, a distinct use of dependency and constituent types, shorter constituents, and more optimized dependency distances. Humans tend to exhibit stronger negative emotions (such as fear and disgust) and less joy compared to text generated by LLMs, with the toxicity of these models increasing as their size grows. LLM outputs use more numbers, symbols and auxiliaries (suggesting objective language) than human texts, as well as more pronouns. The sexist bias prevalent in human text is also expressed by LLMs, and even magnified in all of them but one. Differences between LLMs and humans are larger than between LLMs.
comment: Published at Artificial Intelligence Review vol. 57, 265
♻ ☆ LLM-as-a-tutor in EFL Writing Education: Focusing on Evaluation of Student-LLM Interaction
In the context of English as a Foreign Language (EFL) writing education, LLM-as-a-tutor can assist students by providing real-time feedback on their essays. However, challenges arise in assessing LLM-as-a-tutor due to differing standards between educational and general use cases. To bridge this gap, we integrate pedagogical principles to assess student-LLM interaction. First, we explore how LLMs can function as English tutors, providing effective essay feedback tailored to students. Second, we propose three metrics to evaluate LLM-as-a-tutor specifically designed for EFL writing education, emphasizing pedagogical aspects. In this process, EFL experts evaluate the feedback from LLM-as-a-tutor regarding quality and characteristics. On the other hand, EFL learners assess their learning outcomes from interaction with LLM-as-a-tutor. This approach lays the groundwork for developing LLMs-as-a-tutor tailored to the needs of EFL learners, advancing the effectiveness of writing education in this context.
♻ ☆ MLR-Copilot: Autonomous Machine Learning Research based on Large Language Models Agents
Machine learning research, crucial for technological advancements and innovation, often faces significant challenges due to its inherent complexity, slow pace of experimentation, and the necessity for specialized expertise. Motivated by this, we present a new systematic framework, autonomous Machine Learning Research with large language models (MLR-Copilot), designed to enhance machine learning research productivity through the automatic generation and implementation of research ideas using Large Language Model (LLM) agents. The framework consists of three phases: research idea generation, experiment implementation, and implementation execution. First, existing research papers are used to generate hypotheses and experimental plans vis IdeaAgent powered by LLMs. Next, the implementation generation phase translates these plans into executables with ExperimentAgent. This phase leverages retrieved prototype code and optionally retrieves candidate models and data. Finally, the execution phase, also managed by ExperimentAgent, involves running experiments with mechanisms for human feedback and iterative debugging to enhance the likelihood of achieving executable research outcomes. We evaluate our framework on five machine learning research tasks and the experimental results show the framework's potential to facilitate the research progress and innovations.
♻ ☆ Show Me the World in My Language: Establishing the First Baseline for Scene-Text to Scene-Text Translation ICPR 2024
In this work, we study the task of ``visually'' translating scene text from a source language (e.g., Hindi) to a target language (e.g., English). Visual translation involves not just the recognition and translation of scene text but also the generation of the translated image that preserves visual features of the source scene text, such as font, size, and background. There are several challenges associated with this task, such as translation with limited context, deciding between translation and transliteration, accommodating varying text lengths within fixed spatial boundaries, and preserving the font and background styles of the source scene text in the target language. To address this problem, we make the following contributions: (i) We study visual translation as a standalone problem for the first time in the literature. (ii) We present a cascaded framework for visual translation that combines state-of-the-art modules for scene text recognition, machine translation, and scene text synthesis as a baseline for the task. (iii) We propose a set of task-specific design enhancements to design a variant of the baseline to obtain performance improvements. (iv) Currently, the existing related literature lacks any comprehensive performance evaluation for this novel task. To fill this gap, we introduce several automatic and user-assisted evaluation metrics designed explicitly for evaluating visual translation. Further, we evaluate presented baselines for translating scene text between Hindi and English. Our experiments demonstrate that although we can effectively perform visual translation over a large collection of scene text images, the presented baseline only partially addresses challenges posed by visual translation tasks. We firmly believe that this new task and the limitations of existing models, as reported in this paper, should encourage further research in visual translation.
comment: Accepted at ICPR 2024, Project Website: https://vl2g.github.io/projects/visTrans/
♻ ☆ An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models ECCV 2024
In this study, we identify the inefficient attention phenomena in Large Vision-Language Models (LVLMs), notably within prominent models like LLaVA-1.5, QwenVL-Chat and Video-LLaVA. We find out that the attention computation over visual tokens is of extreme inefficiency in the deep layers of popular LVLMs, suggesting a need for a sparser approach compared to textual data handling. To this end, we introduce FastV, a versatile plug-and-play method designed to optimize computational efficiency by learning adaptive attention patterns in early layers and pruning visual tokens in subsequent ones. Our evaluations demonstrate FastV's ability to dramatically reduce computational costs (e.g., a 45 reduction in FLOPs for LLaVA-1.5-13B) without sacrificing performance in a wide range of image and video understanding tasks. The computational efficiency and performance trade-off of FastV are highly customizable and pareto-efficient. It can compress the FLOPs of a 13B-parameter model to achieve a lower budget than that of a 7B-parameter model, while still maintaining superior performance. We believe FastV has practical values for deployment of LVLMs in edge devices and commercial models. Code is released at https://github.com/pkunlp-icler/FastV.
comment: Accepted to ECCV 2024 (Oral), code is released at https://github.com/pkunlp-icler/FastV,
♻ ☆ CHiSafetyBench: A Chinese Hierarchical Safety Benchmark for Large Language Models
With the profound development of large language models(LLMs), their safety concerns have garnered increasing attention. However, there is a scarcity of Chinese safety benchmarks for LLMs, and the existing safety taxonomies are inadequate, lacking comprehensive safety detection capabilities in authentic Chinese scenarios. In this work, we introduce CHiSafetyBench, a dedicated safety benchmark for evaluating LLMs' capabilities in identifying risky content and refusing answering risky questions in Chinese contexts. CHiSafetyBench incorporates a dataset that covers a hierarchical Chinese safety taxonomy consisting of 5 risk areas and 31 categories. This dataset comprises two types of tasks: multiple-choice questions and question-answering, evaluating LLMs from the perspectives of risk content identification and the ability to refuse answering risky questions respectively. Utilizing this benchmark, we validate the feasibility of automatic evaluation as a substitute for human evaluation and conduct comprehensive automatic safety assessments on mainstream Chinese LLMs. Our experiments reveal the varying performance of different models across various safety domains, indicating that all models possess considerable potential for improvement in Chinese safety capabilities. Our dataset is publicly available at https://github.com/UnicomAI/UnicomBenchmark/tree/main/CHiSafetyBench.
comment: 16 pages, 5 figures
♻ ☆ ACORN: Aspect-wise Commonsense Reasoning Explanation Evaluation
Evaluating the quality of free-text explanations is a multifaceted, subjective, and labor-intensive task. Large language models (LLMs) present an appealing alternative due to their potential for consistency, scalability, and cost-efficiency. In this work, we present ACORN, a new dataset of 3,500 free-text explanations and aspect-wise quality ratings, and use it to evaluate how LLMs rate explanations. We observed that larger models outputted labels that maintained or increased the inter-annotator agreement, suggesting that they are within the expected variance between human raters. However, their correlation with majority-voted human ratings varied across different quality aspects, indicating that they are not a complete replacement. In turn, using LLMs as a supplement to a smaller group of human raters in some cases improved the correlation with the original majority labels. However, the effect was limited to cases where human raters were scarce, and an additional human rater had a more pronounced effect in all cases. Overall, we recommend against using LLMs as a complete replacement for human raters but encourage using them in configurations that end with targeted human involvement. Data available here: https://github.com/a-brassard/ACORN
comment: 18 pages, 7 figures, accepted to COLM 2024. Data available here: https://github.com/a-brassard/ACORN
♻ ☆ MM-Soc: Benchmarking Multimodal Large Language Models in Social Media Platforms ACL 2024
Social media platforms are hubs for multimodal information exchange, encompassing text, images, and videos, making it challenging for machines to comprehend the information or emotions associated with interactions in online spaces. Multimodal Large Language Models (MLLMs) have emerged as a promising solution to these challenges, yet they struggle to accurately interpret human emotions and complex content such as misinformation. This paper introduces MM-Soc, a comprehensive benchmark designed to evaluate MLLMs' understanding of multimodal social media content. MM-Soc compiles prominent multimodal datasets and incorporates a novel large-scale YouTube tagging dataset, targeting a range of tasks from misinformation detection, hate speech detection, and social context generation. Through our exhaustive evaluation on ten size-variants of four open-source MLLMs, we have identified significant performance disparities, highlighting the need for advancements in models' social understanding capabilities. Our analysis reveals that, in a zero-shot setting, various types of MLLMs generally exhibit difficulties in handling social media tasks. However, MLLMs demonstrate performance improvements post fine-tuning, suggesting potential pathways for improvement. Our code and data are available at https://github.com/claws-lab/MMSoc.git.
comment: In Proceedings of ACL 2024
♻ ☆ Persuasion Games using Large Language Models
Large Language Models (LLMs) have emerged as formidable instruments capable of comprehending and producing human-like text. This paper explores the potential of LLMs, to shape user perspectives and subsequently influence their decisions on particular tasks. This capability finds applications in diverse domains such as Investment, Credit cards and Insurance, wherein they assist users in selecting appropriate insurance policies, investment plans, Credit cards, Retail, as well as in Behavioral Change Support Systems (BCSS). We present a sophisticated multi-agent framework wherein a consortium of agents operate in collaborative manner. The primary agent engages directly with user agents through persuasive dialogue, while the auxiliary agents perform tasks such as information retrieval, response analysis, development of persuasion strategies, and validation of facts. Empirical evidence from our experiments demonstrates that this collaborative methodology significantly enhances the persuasive efficacy of the LLM. We continuously analyze the resistance of the user agent to persuasive efforts and counteract it by employing a combination of rule-based and LLM-based resistance-persuasion mapping techniques. We employ simulated personas and generate conversations in insurance, banking, and retail domains to evaluate the proficiency of large language models (LLMs) in recognizing, adjusting to, and influencing various personality types. Concurrently, we examine the resistance mechanisms employed by LLM simulated personas. Persuasion is quantified via measurable surveys before and after interaction, LLM-generated scores on conversation, and user decisions (purchase or non-purchase).
♻ ☆ Cultural Compass: Predicting Transfer Learning Success in Offensive Language Detection with Cultural Features EMNLP 2023
The increasing ubiquity of language technology necessitates a shift towards considering cultural diversity in the machine learning realm, particularly for subjective tasks that rely heavily on cultural nuances, such as Offensive Language Detection (OLD). Current understanding underscores that these tasks are substantially influenced by cultural values, however, a notable gap exists in determining if cultural features can accurately predict the success of cross-cultural transfer learning for such subjective tasks. Addressing this, our study delves into the intersection of cultural features and transfer learning effectiveness. The findings reveal that cultural value surveys indeed possess a predictive power for cross-cultural transfer learning success in OLD tasks and that it can be further improved using offensive word distance. Based on these results, we advocate for the integration of cultural information into datasets. Additionally, we recommend leveraging data sources rich in cultural information, such as surveys, to enhance cultural adaptability. Our research signifies a step forward in the quest for more inclusive, culturally sensitive language technologies.
comment: Findings of EMNLP 2023 (update)
♻ ☆ From Wide to Deep: Dimension Lifting Network for Parameter-efficient Knowledge Graph Embedding
Knowledge graph embedding (KGE) that maps entities and relations into vector representations is essential for downstream applications. Conventional KGE methods require high-dimensional representations to learn the complex structure of knowledge graph, but lead to oversized model parameters. Recent advances reduce parameters by low-dimensional entity representations, while developing techniques (e.g., knowledge distillation or reinvented representation forms) to compensate for reduced dimension. However, such operations introduce complicated computations and model designs that may not benefit large knowledge graphs. To seek a simple strategy to improve the parameter efficiency of conventional KGE models, we take inspiration from that deeper neural networks require exponentially fewer parameters to achieve expressiveness comparable to wider networks for compositional structures. We view all entity representations as a single-layer embedding network, and conventional KGE methods that adopt high-dimensional entity representations equal widening the embedding network to gain expressiveness. To achieve parameter efficiency, we instead propose a deeper embedding network for entity representations, i.e., a narrow entity embedding layer plus a multi-layer dimension lifting network (LiftNet). Experiments on three public datasets show that by integrating LiftNet, four conventional KGE methods with 16-dimensional representations achieve comparable link prediction accuracy as original models that adopt 512-dimensional representations, saving 68.4% to 96.9% parameters.
Computer Vision and Pattern Recognition
♻ ☆ Inter-Frame Compression for Dynamic Point Cloud Geometry Coding
Efficient point cloud compression is essential for applications like virtual and mixed reality, autonomous driving, and cultural heritage. This paper proposes a deep learning-based inter-frame encoding scheme for dynamic point cloud geometry compression. We propose a lossy geometry compression scheme that predicts the latent representation of the current frame using the previous frame by employing a novel feature space inter-prediction network. The proposed network utilizes sparse convolutions with hierarchical multiscale 3D feature learning to encode the current frame using the previous frame. The proposed method introduces a novel predictor network for motion compensation in the feature domain to map the latent representation of the previous frame to the coordinates of the current frame to predict the current frame's feature embedding. The framework transmits the residual of the predicted features and the actual features by compressing them using a learned probabilistic factorized entropy model. At the receiver, the decoder hierarchically reconstructs the current frame by progressively rescaling the feature embedding. The proposed framework is compared to the state-of-the-art Video-based Point Cloud Compression (V-PCC) and Geometry-based Point Cloud Compression (G-PCC) schemes standardized by the Moving Picture Experts Group (MPEG). The proposed method achieves more than 88% BD-Rate (Bjontegaard Delta Rate) reduction against G-PCCv20 Octree, more than 56% BD-Rate savings against G-PCCv20 Trisoup, more than 62% BD-Rate reduction against V-PCC intra-frame encoding mode, and more than 52% BD-Rate savings against V-PCC P-frame-based inter-frame encoding mode using HEVC. These significant performance gains are cross-checked and verified in the MPEG working group.
♻ ☆ SEDMamba: Enhancing Selective State Space Modelling with Bottleneck Mechanism and Fine-to-Coarse Temporal Fusion for Efficient Error Detection in Robot-Assisted Surgery
Automated detection of surgical errors can improve robotic-assisted surgery. Despite promising progress, existing methods still face challenges in capturing rich temporal context to establish long-term dependencies while maintaining computational efficiency. In this paper, we propose a novel hierarchical model named SEDMamba, which incorporates the selective state space model (SSM) into surgical error detection, facilitating efficient long sequence modelling with linear complexity. SEDMamba enhances selective SSM with a bottleneck mechanism and fine-to-coarse temporal fusion (FCTF) to detect and temporally localize surgical errors in long videos. The bottleneck mechanism compresses and restores features within their spatial dimension, thereby reducing computational complexity. FCTF utilizes multiple dilated 1D convolutional layers to merge temporal information across diverse scale ranges, accommodating errors of varying duration. Our work also contributes the first-of-its-kind, frame-level, in-vivo surgical error dataset to support error detection in real surgical cases. Specifically, we deploy the clinically validated observational clinical human reliability assessment tool (OCHRA) to annotate the errors during suturing tasks in an open-source radical prostatectomy dataset (SAR-RARP50). Experimental results demonstrate that our SEDMamba outperforms state-of-the-art methods with at least 1.82% AUC and 3.80% AP performance gains with significantly reduced computational complexity. The corresponding error annotations, code and models will be released at https://github.com/wzjialang/SEDMamba.
comment: 8 pages
♻ ☆ Efficient Video Object Segmentation via Modulated Cross-Attention Memory WACV 2025
Recently, transformer-based approaches have shown promising results for semi-supervised video object segmentation. However, these approaches typically struggle on long videos due to increased GPU memory demands, as they frequently expand the memory bank every few frames. We propose a transformer-based approach, named MAVOS, that introduces an optimized and dynamic long-term modulated cross-attention (MCA) memory to model temporal smoothness without requiring frequent memory expansion. The proposed MCA effectively encodes both local and global features at various levels of granularity while efficiently maintaining consistent speed regardless of the video length. Extensive experiments on multiple benchmarks, LVOS, Long-Time Video, and DAVIS 2017, demonstrate the effectiveness of our proposed contributions leading to real-time inference and markedly reduced memory demands without any degradation in segmentation accuracy on long videos. Compared to the best existing transformer-based approach, our MAVOS increases the speed by 7.6x, while significantly reducing the GPU memory by 87% with comparable segmentation performance on short and long video datasets. Notably on the LVOS dataset, our MAVOS achieves a J&F score of 63.3% while operating at 37 frames per second (FPS) on a single V100 GPU. Our code and models will be publicly available at: https://github.com/Amshaker/MAVOS.
comment: WACV 2025
♻ ☆ RISSOLE: Parameter-efficient Diffusion Models via Block-wise Generation and Retrieval-Guidance
Diffusion-based models demonstrate impressive generation capabilities. However, they also have a massive number of parameters, resulting in enormous model sizes, thus making them unsuitable for deployment on resource-constraint devices. Block-wise generation can be a promising alternative for designing compact-sized (parameter-efficient) deep generative models since the model can generate one block at a time instead of generating the whole image at once. However, block-wise generation is also considerably challenging because ensuring coherence across generated blocks can be non-trivial. To this end, we design a retrieval-augmented generation (RAG) approach and leverage the corresponding blocks of the images retrieved by the RAG module to condition the training and generation stages of a block-wise denoising diffusion model. Our conditioning schemes ensure coherence across the different blocks during training and, consequently, during generation. While we showcase our approach using the latent diffusion model (LDM) as the base model, it can be used with other variants of denoising diffusion models. We validate the solution of the coherence problem through the proposed approach by reporting substantive experiments to demonstrate our approach's effectiveness in compact model size and excellent generation quality.
♻ ☆ Multi-Visual-Inertial System: Analysis, Calibration and Estimation
In this paper, we study state estimation of multi-visual-inertial systems (MVIS) and develop sensor fusion algorithms to optimally fuse an arbitrary number of asynchronous inertial measurement units (IMUs) or gyroscopes and global and(or) rolling shutter cameras. We are especially interested in the full calibration of the associated visual-inertial sensors, including the IMU or camera intrinsics and the IMU-IMU(or camera) spatiotemporal extrinsics as well as the image readout time of rolling-shutter cameras (if used). To this end, we develop a new analytic combined IMU integration with intrinsics-termed ACI3-to preintegrate IMU measurements, which is leveraged to fuse auxiliary IMUs and(or) gyroscopes alongside a base IMU. We model the multi-inertial measurements to include all the necessary inertial intrinsic and IMU-IMU spatiotemporal extrinsic parameters, while leveraging IMU-IMU rigid-body constraints to eliminate the necessity of auxiliary inertial poses and thus reducing computational complexity. By performing observability analysis of MVIS, we prove that the standard four unobservable directions remain - no matter how many inertial sensors are used, and also identify, for the first time, degenerate motions for IMU-IMU spatiotemporal extrinsics and auxiliary inertial intrinsics. In addition to the extensive simulations that validate our analysis and algorithms, we have built our own MVIS sensor rig and collected over 25 real-world datasets to experimentally verify the proposed calibration against the state-of-the-art calibration method such as Kalibr. We show that the proposed MVIS calibration is able to achieve competing accuracy with improved convergence and repeatability, which is open sourced to better benefit the community.
♻ ☆ On Evaluating Adversarial Robustness of Volumetric Medical Segmentation Models
Volumetric medical segmentation models have achieved significant success on organ and tumor-based segmentation tasks in recent years. However, their vulnerability to adversarial attacks remains largely unexplored, raising serious concerns regarding the real-world deployment of tools employing such models in the healthcare sector. This underscores the importance of investigating the robustness of existing models. In this context, our work aims to empirically examine the adversarial robustness across current volumetric segmentation architectures, encompassing Convolutional, Transformer, and Mamba-based models. We extend this investigation across four volumetric segmentation datasets, evaluating robustness under both white box and black box adversarial attacks. Overall, we observe that while both pixel and frequency-based attacks perform reasonably well under \emph{white box} setting, the latter performs significantly better under transfer-based black box attacks. Across our experiments, we observe transformer-based models show higher robustness than convolution-based models with Mamba-based models being the most vulnerable. Additionally, we show that large-scale training of volumetric segmentation models improves the model's robustness against adversarial attacks. The code and robust models are available at https://github.com/HashmatShadab/Robustness-of-Volumetric-Medical-Segmentation-Models.
comment: Accepted at British Machine Vision Conference 2024
♻ ☆ Training-free Long Video Generation with Chain of Diffusion Model Experts
Video generation models hold substantial potential in areas such as filmmaking. However, current video diffusion models need high computational costs and produce suboptimal results due to high complexity of video generation task. In this paper, we propose \textbf{ConFiner}, an efficient high-quality video generation framework that decouples video generation into easier subtasks: structure \textbf{con}trol and spatial-temporal re\textbf{fine}ment. It can generate high-quality videos with chain of off-the-shelf diffusion model experts, each expert responsible for a decoupled subtask. During the refinement, we introduce coordinated denoising, which can merge multiple diffusion experts' capabilities into a single sampling. Furthermore, we design ConFiner-Long framework, which can generate long coherent video with three constraint strategies on ConFiner. Experimental results indicate that with only 10\% of the inference cost, our ConFiner surpasses representative models like Lavie and Modelscope across all objective and subjective metrics. And ConFiner-Long can generate high-quality and coherent videos with up to 600 frames.
♻ ☆ TRAM: Global Trajectory and Motion of 3D Humans from in-the-wild Videos
We propose TRAM, a two-stage method to reconstruct a human's global trajectory and motion from in-the-wild videos. TRAM robustifies SLAM to recover the camera motion in the presence of dynamic humans and uses the scene background to derive the motion scale. Using the recovered camera as a metric-scale reference frame, we introduce a video transformer model (VIMO) to regress the kinematic body motion of a human. By composing the two motions, we achieve accurate recovery of 3D humans in the world space, reducing global motion errors by a large margin from prior work. https://yufu-wang.github.io/tram4d/
comment: The project website: https://yufu-wang.github.io/tram4d/
♻ ☆ Privacy-Aware Document Visual Question Answering ICDAR 2024
Document Visual Question Answering (DocVQA) has quickly grown into a central task of document understanding. But despite the fact that documents contain sensitive or copyrighted information, none of the current DocVQA methods offers strong privacy guarantees. In this work, we explore privacy in the domain of DocVQA for the first time, highlighting privacy issues in state of the art multi-modal LLM models used for DocVQA, and explore possible solutions. Specifically, we focus on invoice processing as a realistic document understanding scenario, and propose a large scale DocVQA dataset comprising invoice documents and associated questions and answers. We employ a federated learning scheme, that reflects the real-life distribution of documents in different businesses, and we explore the use case where the data of the invoice provider is the sensitive information to be protected. We demonstrate that non-private models tend to memorise, a behaviour that can lead to exposing private information. We then evaluate baseline training schemes employing federated learning and differential privacy in this multi-modal scenario, where the sensitive information might be exposed through either or both of the two input modalities: vision (document image) or language (OCR tokens). Finally, we design attacks exploiting the memorisation effect of the model, and demonstrate their effectiveness in probing a representative DocVQA models.
comment: 35 pages, 12 figures, accepted for publication at the 18th International Conference on Document Analysis and Recognition, ICDAR 2024
♻ ☆ Does Data-Efficient Generalization Exacerbate Bias in Foundation Models? ECCV 2024
Foundation models have emerged as robust models with label efficiency in diverse domains. In medical imaging, these models contribute to the advancement of medical diagnoses due to the difficulty in obtaining labeled data. However, it is unclear whether using a large amount of unlabeled data, biased by the presence of sensitive attributes during pre-training, influences the fairness of the model. This research examines the bias in the Foundation model (RetFound) when it is applied to fine-tune the Brazilian Multilabel Ophthalmological Dataset (BRSET), which has a different population than the pre-training dataset. The model evaluation, in comparison with supervised learning, shows that the Foundation Model has the potential to reduce the gap between the maximum AUC and minimum AUC evaluations across gender and age groups. However, in a data-efficient generalization, the model increases the bias when the data amount decreases. These findings suggest that when deploying a Foundation Model in real-life scenarios with limited data, the possibility of fairness issues should be considered.
comment: Preprint of paper to be presented at Fairness and Ethics Towards Transparent AI: Facing the Challenge through Model Debiasing (FAILED) during ECCV 2024
♻ ☆ LSMS: Language-guided Scale-aware MedSegmentor for Medical Image Referring Segmentation
Conventional medical image segmentation methods have been found inadequate in facilitating physicians with the identification of specific lesions for diagnosis and treatment. Given the utility of text as an instructional format, we introduce a novel task termed Medical Image Referring Segmentation (MIRS), which requires segmenting specified lesions in images based on the given language expressions. Due to the varying object scales in medical images, MIRS demands robust vision-language modeling and comprehensive multi-scale interaction for precise localization and segmentation under linguistic guidance. However, existing medical image segmentation methods fall short in meeting these demands, resulting in insufficient segmentation accuracy. In response, we propose an approach named Language-guided Scale-aware MedSegmentor (LSMS), incorporating two appealing designs: (1)~a Scale-aware Vision-Language Attention module that leverages diverse convolutional kernels to acquire rich visual knowledge and interact closely with linguistic features, thereby enhancing lesion localization capability; (2)~a Full-Scale Decoder that globally models multi-modal features across various scales, capturing complementary information between scales to accurately outline lesion boundaries. Addressing the lack of suitable datasets for MIRS, we constructed a vision-language medical dataset called Reference Hepatic Lesion Segmentation (RefHL-Seg). This dataset comprises 2,283 abdominal CT slices from 231 cases, with corresponding textual annotations and segmentation masks for various liver lesions in images. We validated the performance of LSMS for MIRS and conventional medical image segmentation tasks across various datasets. Our LSMS consistently outperforms on all datasets with lower computational costs. The code and datasets will be released.
comment: 14 pages, 5 figures
♻ ☆ Implicit Concept Removal of Diffusion Models ECCV2024
Text-to-image (T2I) diffusion models often inadvertently generate unwanted concepts such as watermarks and unsafe images. These concepts, termed as the "implicit concepts", could be unintentionally learned during training and then be generated uncontrollably during inference. Existing removal methods still struggle to eliminate implicit concepts primarily due to their dependency on the model's ability to recognize concepts it actually can not discern. To address this, we utilize the intrinsic geometric characteristics of implicit concepts and present the Geom-Erasing, a novel concept removal method based on the geometric-driven control. Specifically, once an unwanted implicit concept is identified, we integrate the existence and geometric information of the concept into the text prompts with the help of an accessible classifier or detector model. Subsequently, the model is optimized to identify and disentangle this information, which is then adopted as negative prompts during generation. Moreover, we introduce the Implicit Concept Dataset (ICD), a novel image-text dataset imbued with three typical implicit concepts (i.e., QR codes, watermarks, and text), reflecting real-life situations where implicit concepts are easily injected. Geom-Erasing effectively mitigates the generation of implicit concepts, achieving the state-of-the-art results on the Inappropriate Image Prompts (I2P) and our challenging Implicit Concept Dataset (ICD) benchmarks.
comment: Accepted by ECCV2024. Project Page: https://kaichen1998.github.io/projects/geom-erasing/
♻ ☆ GUing: A Mobile GUI Search Engine using a Vision-Language Model
App developers use the Graphical User Interface (GUI) of other apps as a source of inspiration for designing and improving their own apps. Recent research has thus suggested retrieving relevant GUI designs that match a certain text query from screenshot datasets acquired through crowdsourced or automated exploration of GUIs. However, such text-to-GUI retrieval approaches only leverage the textual information of the GUI elements, neglecting visual information such as icons or background images. In addition, retrieved screenshots are not steered by app developers and often lack important app features that require particular input data. To overcome these limitations, this paper proposes GUing, a GUI search engine based on a vision-language model called GUIClip, which we trained specifically for the problem of designing app GUIs. For this, we first collected from Google Play app introduction images which usually display the most representative screenshots and are often captioned (i.e.~labeled) by app vendors. Then, we developed an automated pipeline to classify, crop, and extract the captions from these images. This resulted in a large dataset which we share with this paper: including 303k app screenshots, out of which 135k have captions. We used this dataset to train a novel vision-language model, which is, to the best of our knowledge, the first of its kind in GUI retrieval. We evaluated our approach on various datasets from related work and in manual experiment. The results demonstrate that our model outperforms previous approaches in text-to-GUI retrieval achieving a Recall@10 of up to 0.69 and a HIT@10 of 0.91. We also explored the performance of GUIClip for other GUI tasks including GUI classification and sketch-to-GUI retrieval with encouraging results.
♻ ☆ An open dataset for oracle bone script recognition and decipherment
Oracle bone script, one of the earliest known forms of ancient Chinese writing, presents invaluable research materials for scholars studying the humanities and geography of the Shang Dynasty, dating back 3,000 years. The immense historical and cultural significance of these writings cannot be overstated. However, the passage of time has obscured much of their meaning, presenting a significant challenge in deciphering these ancient texts. With the advent of Artificial Intelligence (AI), employing AI to assist in deciphering Oracle Bone Characters (OBCs) has become a feasible option. Yet, progress in this area has been hindered by a lack of high-quality datasets. To address this issue, this paper details the creation of the HUST-OBC dataset. This dataset encompasses 77,064 images of 1,588 individual deciphered characters and 62,989 images of 9,411 undeciphered characters, with a total of 140,053 images, compiled from diverse sources. The hope is that this dataset could inspire and assist future research in deciphering those unknown OBCs. All the codes and datasets are available at https://github.com/Yuliang-Liu/Open-Oracle.
♻ ☆ SABER-6D: Shape Representation Based Implicit Object Pose Estimation ECCV 2024
In this paper, we propose a novel encoder-decoder architecture, named SABER, to learn the 6D pose of the object in the embedding space by learning shape representation at a given pose. This model enables us to learn pose by performing shape representation at a target pose from RGB image input. We perform shape representation as an auxiliary task which helps us in learning rotations space for an object based on 2D images. An image encoder predicts the rotation in the embedding space and the DeepSDF based decoder learns to represent the object's shape at the given pose. As our approach is shape based, the pipeline is suitable for any type of object irrespective of the symmetry. Moreover, we need only a CAD model of the objects to train SABER. Our pipeline is synthetic data based and can also handle symmetric objects without symmetry labels and, thus, no additional labeled training data is needed. The experimental evaluation shows that our method achieves close to benchmark results for both symmetric objects and asymmetric objects on Occlusion-LineMOD, and T-LESS datasets.
comment: ECCV 2024 R6D workshop
♻ ☆ A Deep-Learning-Based Label-free No-Reference Image Quality Assessment Metric: Application in Sodium MRI Denoising
New multinuclear MRI techniques, such as sodium MRI, generally suffer from low image quality due to an inherently low signal. Postprocessing methods, such as image denoising, have been developed for image enhancement. However, the assessment of these enhanced images is challenging especially considering when there is a lack of high resolution and high signal images as reference, such as in sodium MRI. No-reference Image Quality Assessment (NR-IQA) metrics are approaches to solve this problem. Existing learning-based NR-IQA metrics rely on labels derived from subjective human opinions or metrics like Signal-to-Noise Ratio (SNR), which are either time-consuming or lack accurate ground truths, resulting in unreliable assessment. We note that deep learning (DL) models have a unique characteristic in that they are specialized to a characteristic training set, meaning that deviations between the input testing data from the training data will reduce prediction accuracy. Therefore, we propose a novel DL-based NR-IQA metric, the Model Specialization Metric (MSM), which does not depend on ground-truth images or labels. MSM measures the difference between the input image and the model's prediction for evaluating the quality of the input image. Experiments conducted on both simulated distorted proton T1-weighted MR images and denoised sodium MR images demonstrate that MSM exhibits a superior evaluation performance on various simulated noises and distortions. MSM also has a substantial agreement with the expert evaluations, achieving an averaged Cohen's Kappa coefficient of 0.6528, outperforming the existing NR-IQA metrics.
comment: 13 pages, 3 figures
♻ ☆ Mamba3D: Enhancing Local Features for 3D Point Cloud Analysis via State Space Model ACM MM 2024
Existing Transformer-based models for point cloud analysis suffer from quadratic complexity, leading to compromised point cloud resolution and information loss. In contrast, the newly proposed Mamba model, based on state space models (SSM), outperforms Transformer in multiple areas with only linear complexity. However, the straightforward adoption of Mamba does not achieve satisfactory performance on point cloud tasks. In this work, we present Mamba3D, a state space model tailored for point cloud learning to enhance local feature extraction, achieving superior performance, high efficiency, and scalability potential. Specifically, we propose a simple yet effective Local Norm Pooling (LNP) block to extract local geometric features. Additionally, to obtain better global features, we introduce a bidirectional SSM (bi-SSM) with both a token forward SSM and a novel backward SSM that operates on the feature channel. Extensive experimental results show that Mamba3D surpasses Transformer-based counterparts and concurrent works in multiple tasks, with or without pre-training. Notably, Mamba3D achieves multiple SoTA, including an overall accuracy of 92.6% (train from scratch) on the ScanObjectNN and 95.1% (with single-modal pre-training) on the ModelNet40 classification task, with only linear complexity. Our code and weights are available at https://github.com/xhanxu/Mamba3D.
comment: ACM MM 2024. Code and weights are available at https://github.com/xhanxu/Mamba3D
♻ ☆ Enabling Local Editing in Diffusion Models by Joint and Individual Component Analysis BMVC2024
Recent advances in Diffusion Models (DMs) have led to significant progress in visual synthesis and editing tasks, establishing them as a strong competitor to Generative Adversarial Networks (GANs). However, the latent space of DMs is not as well understood as that of GANs. Recent research has focused on unsupervised semantic discovery in the latent space of DMs by leveraging the bottleneck layer of the denoising network, which has been shown to exhibit properties of a semantic latent space. However, these approaches are limited to discovering global attributes. In this paper we address, the challenge of local image manipulation in DMs and introduce an unsupervised method to factorize the latent semantics learned by the denoising network of pre-trained DMs. Given an arbitrary image and defined regions of interest, we utilize the Jacobian of the denoising network to establish a relation between the regions of interest and their corresponding subspaces in the latent space. Furthermore, we disentangle the joint and individual components of these subspaces to identify latent directions that enable local image manipulation. Once discovered, these directions can be applied to different images to produce semantically consistent edits, making our method suitable for practical applications. Experimental results on various datasets demonstrate that our method can produce semantic edits that are more localized and have better fidelity compared to the state-of-the-art.
comment: Accepted at BMVC2024
♻ ☆ Structured Generative Models for Scene Understanding
This position paper argues for the use of \emph{structured generative models} (SGMs) for the understanding of static scenes. This requires the reconstruction of a 3D scene from an input image (or a set of multi-view images), whereby the contents of the image(s) are causally explained in terms of models of instantiated objects, each with their own type, shape, appearance and pose, along with global variables like scene lighting and camera parameters. This approach also requires scene models which account for the co-occurrences and inter-relationships of objects in a scene. The SGM approach has the merits that it is compositional and generative, which lead to interpretability and editability. \\\\ To pursue the SGM agenda, we need models for objects and scenes, and approaches to carry out inference. We first review models for objects, which include ``things'' (object categories that have a well defined shape), and ``stuff'' (categories which have amorphous spatial extent). We then move on to review \emph{scene models} which describe the inter-relationships of objects. Perhaps the most challenging problem for SGMs is \emph{inference} of the objects, lighting and camera parameters, and scene inter-relationships from input consisting of a single or multiple images. We conclude with a discussion of issues that need addressing to advance the SGM agenda.
comment: 32 pages, 10 figures
♻ ☆ MAPL: Memory Augmentation and Pseudo-Labeling for Semi-Supervised Anomaly Detection
Large unlabeled data and difficult-to-identify anomalies are the urgent issues need to overcome in most industrial scene. In order to address this issue, a new meth-odology for detecting surface defects in in-dustrial settings is introduced, referred to as Memory Augmentation and Pseudo-Labeling(MAPL). The methodology first in-troduces an anomaly simulation strategy, which significantly improves the model's ability to recognize rare or unknown anom-aly types by generating simulated anomaly samples. To cope with the problem of the lack of labeling of anomalous simulated samples, a pseudo-labeler method based on a one-classifier ensemble was employed in this study, which enhances the robustness of the model in the case of limited labeling data by automatically selecting key pseudo-labeling hyperparameters. Meanwhile, a memory-enhanced learning mechanism is introduced to effectively predict abnormal regions by analyzing the difference be-tween the input samples and the normal samples in the memory pool. An end-to-end learning framework is employed by MAPL to identify the abnormal regions directly from the input data, which optimizes the ef-ficiency and real-time performance of de-tection. By conducting extensive trials on the recently developed BHAD dataset (in-cluding MVTec AD [1], Visa [2], and MDPP [3]), MAPL achieves an average im-age-level AUROC score of 86.2%, demon-strating a 5.1% enhancement compared to the original MemSeg [4] model. The source code is available at https://github.com/jzc777/MAPL.
♻ ☆ MCDubber: Multimodal Context-Aware Expressive Video Dubbing SC2024
Automatic Video Dubbing (AVD) aims to take the given script and generate speech that aligns with lip motion and prosody expressiveness. Current AVD models mainly utilize visual information of the current sentence to enhance the prosody of synthesized speech. However, it is crucial to consider whether the prosody of the generated dubbing aligns with the multimodal context, as the dubbing will be combined with the original context in the final video. This aspect has been overlooked in previous studies. To address this issue, we propose a Multimodal Context-aware video Dubbing model, termed \textbf{MCDubber}, to convert the modeling object from a single sentence to a longer sequence with context information to ensure the consistency of the global context prosody. MCDubber comprises three main components: (1) A context duration aligner aims to learn the context-aware alignment between the text and lip frames; (2) A context prosody predictor seeks to read the global context visual sequence and predict the context-aware global energy and pitch; (3) A context acoustic decoder ultimately predicts the global context mel-spectrogram with the assistance of adjacent ground-truth mel-spectrograms of the target sentence. Through this process, MCDubber fully considers the influence of multimodal context on the prosody expressiveness of the current sentence when dubbing. The extracted mel-spectrogram belonging to the target sentence from the output context mel-spectrograms is the final required dubbing audio. Extensive experiments on the Chem benchmark dataset demonstrate that our MCDubber significantly improves dubbing expressiveness compared to all advanced baselines. The code and demos are available at https://github.com/XiaoYuanJun-zy/MCDubber.
comment: Accepted by NCMMSC2024
♻ ☆ UniUSNet: A Promptable Framework for Universal Ultrasound Disease Prediction and Tissue Segmentation
Ultrasound is widely used in clinical practice due to its affordability, portability, and safety. However, current AI research often overlooks combined disease prediction and tissue segmentation. We propose UniUSNet, a universal framework for ultrasound image classification and segmentation. This model handles various ultrasound types, anatomical positions, and input formats, excelling in both segmentation and classification tasks. Trained on a comprehensive dataset with over 9.7K annotations from 7 distinct anatomical positions, our model matches state-of-the-art performance and surpasses single-dataset and ablated models. Zero-shot and fine-tuning experiments show strong generalization and adaptability with minimal fine-tuning. We plan to expand our dataset and refine the prompting mechanism, with model weights and code available at (https://github.com/Zehui-Lin/UniUSNet).
comment: Accepted to BIBM 2024
♻ ☆ GuidedNet: Semi-Supervised Multi-Organ Segmentation via Labeled Data Guide Unlabeled Data ACM MM2024
Semi-supervised multi-organ medical image segmentation aids physicians in improving disease diagnosis and treatment planning and reduces the time and effort required for organ annotation.Existing state-of-the-art methods train the labeled data with ground truths and train the unlabeled data with pseudo-labels. However, the two training flows are separate, which does not reflect the interrelationship between labeled and unlabeled data.To address this issue, we propose a semi-supervised multi-organ segmentation method called GuidedNet, which leverages the knowledge from labeled data to guide the training of unlabeled data. The primary goals of this study are to improve the quality of pseudo-labels for unlabeled data and to enhance the network's learning capability for both small and complex organs.A key concept is that voxel features from labeled and unlabeled data that are close to each other in the feature space are more likely to belong to the same class.On this basis, a 3D Consistent Gaussian Mixture Model (3D-CGMM) is designed to leverage the feature distributions from labeled data to rectify the generated pseudo-labels.Furthermore, we introduce a Knowledge Transfer Cross Pseudo Supervision (KT-CPS) strategy, which leverages the prior knowledge obtained from the labeled data to guide the training of the unlabeled data, thereby improving the segmentation accuracy for both small and complex organs. Extensive experiments on two public datasets, FLARE22 and AMOS, demonstrated that GuidedNet is capable of achieving state-of-the-art performance. The source code with our proposed model are available at https://github.com/kimjisoo12/GuidedNet.
comment: Accepted by ACM MM2024, 10 pages, 5 figures
♻ ☆ PD-APE: A Parallel Decoding Framework with Adaptive Position Encoding for 3D Visual Grounding
3D visual grounding aims to identify objects in 3D point cloud scenes that match specific natural language descriptions. This requires the model to not only focus on the target object itself but also to consider the surrounding environment to determine whether the descriptions are met. Most previous works attempt to accomplish both tasks within the same module, which can easily lead to a distraction of attention. To this end, we propose PD-APE, a dual-branch decoding framework that separately decodes target object attributes and surrounding layouts. Specifically, in the target object branch, the decoder processes text tokens that describe features of the target object (e.g., category and color), guiding the queries to pay attention to the target object itself. In the surrounding branch, the queries align with other text tokens that carry surrounding environment information, making the attention maps accurately capture the layout described in the text. Benefiting from the proposed dual-branch design, the queries are allowed to focus on points relevant to each branch's specific objective. Moreover, we design an adaptive position encoding method for each branch respectively. In the target object branch, the position encoding relies on the relative positions between seed points and predicted 3D boxes. In the surrounding branch, the attention map is additionally guided by the confidence between visual and text features, enabling the queries to focus on points that have valuable layout information. Extensive experiments demonstrate that we surpass the state-of-the-art on two widely adopted 3D visual grounding datasets, ScanRefer and Nr3D.
♻ ☆ DORec: Decomposed Object Reconstruction and Segmentation Utilizing 2D Self-Supervised Features
Recovering 3D geometry and textures of individual objects is crucial for many robotics applications, such as manipulation, pose estimation, and autonomous driving. However, decomposing a target object from a complex background is challenging. Most existing approaches rely on costly manual labels to acquire object instance perception. Recent advancements in 2D self-supervised learning offer new prospects for identifying objects of interest, yet leveraging such noisy 2D features for clean decomposition remains difficult. In this paper, we propose a Decomposed Object Reconstruction (DORec) network based on neural implicit representations. Our key idea is to use 2D self-supervised features to create two levels of masks for supervision: a binary mask for foreground regions and a K-cluster mask for semantically similar regions. These complementary masks result in robust decomposition. Experimental results on different datasets show DORec's superiority in segmenting and reconstructing diverse foreground objects from varied backgrounds enabling downstream tasks such as pose estimation.
♻ ☆ FRDiff : Feature Reuse for Universal Training-free Acceleration of Diffusion Models ECCV 2024
The substantial computational costs of diffusion models, especially due to the repeated denoising steps necessary for high-quality image generation, present a major obstacle to their widespread adoption. While several studies have attempted to address this issue by reducing the number of score function evaluations (NFE) using advanced ODE solvers without fine-tuning, the decreased number of denoising iterations misses the opportunity to update fine details, resulting in noticeable quality degradation. In our work, we introduce an advanced acceleration technique that leverages the temporal redundancy inherent in diffusion models. Reusing feature maps with high temporal similarity opens up a new opportunity to save computation resources without compromising output quality. To realize the practical benefits of this intuition, we conduct an extensive analysis and propose a novel method, FRDiff. FRDiff is designed to harness the advantages of both reduced NFE and feature reuse, achieving a Pareto frontier that balances fidelity and latency trade-offs in various generative tasks.
comment: Accepted at ECCV 2024. Code : https://github.com/ECoLab-POSTECH/FRDiff
♻ ☆ Evolution-aware VAriance (EVA) Coreset Selection for Medical Image Classification
In the medical field, managing high-dimensional massive medical imaging data and performing reliable medical analysis from it is a critical challenge, especially in resource-limited environments such as remote medical facilities and mobile devices. This necessitates effective dataset compression techniques to reduce storage, transmission, and computational cost. However, existing coreset selection methods are primarily designed for natural image datasets, and exhibit doubtful effectiveness when applied to medical image datasets due to challenges such as intra-class variation and inter-class similarity. In this paper, we propose a novel coreset selection strategy termed as Evolution-aware VAriance (EVA), which captures the evolutionary process of model training through a dual-window approach and reflects the fluctuation of sample importance more precisely through variance measurement. Extensive experiments on medical image datasets demonstrate the effectiveness of our strategy over previous SOTA methods, especially at high compression rates. EVA achieves 98.27% accuracy with only 10% training data, compared to 97.20% for the full training set. None of the compared baseline methods can exceed Random at 5% selection rate, while EVA outperforms Random by 5.61%, showcasing its potential for efficient medical image analysis.
comment: Accepted by ACM Multimedia 2024 (oral), see: https://openreview.net/forum?id=m1qrB9KSYD
♻ ☆ Biometrics and Behavior Analysis for Detecting Distractions in e-Learning
In this article, we explore computer vision approaches to detect abnormal head pose during e-learning sessions and we introduce a study on the effects of mobile phone usage during these sessions. We utilize behavioral data collected from 120 learners monitored while participating in a MOOC learning sessions. Our study focuses on the influence of phone-usage events on behavior and physiological responses, specifically attention, heart rate, and meditation, before, during, and after phone usage. Additionally, we propose an approach for estimating head pose events using images taken by the webcam during the MOOC learning sessions to detect phone-usage events. Our hypothesis suggests that head posture undergoes significant changes when learners interact with a mobile phone, contrasting with the typical behavior seen when learners face a computer during e-learning sessions. We propose an approach designed to detect deviations in head posture from the average observed during a learner's session, operating as a semi-supervised method. This system flags events indicating alterations in head posture for subsequent human review and selection of mobile phone usage occurrences with a sensitivity over 90%.
comment: Published in IEEE Intl. Symposium on Computers in Education (SIIE) 2024
♻ ☆ VAAD: Visual Attention Analysis Dashboard applied to e-Learning
In this paper, we present an approach in the Multimodal Learning Analytics field. Within this approach, we have developed a tool to visualize and analyze eye movement data collected during learning sessions in online courses. The tool is named VAAD, an acronym for Visual Attention Analysis Dashboard. These eye movement data have been gathered using an eye-tracker and subsequently processed and visualized for interpretation. The purpose of the tool is to conduct a descriptive analysis of the data by facilitating its visualization, enabling the identification of differences and learning patterns among various learner populations. Additionally, it integrates a predictive module capable of anticipating learner activities during a learning session. Consequently, VAAD holds the potential to offer valuable insights into online learning behaviors from both descriptive and predictive perspectives.
comment: Published in IEEE Intl. Symposium on Computers in Education (SIIE) 2024
♻ ☆ Dual-scale Enhanced and Cross-generative Consistency Learning for Semi-supervised Medical Image Segmentation
Medical image segmentation plays a crucial role in computer-aided diagnosis. However, existing methods heavily rely on fully supervised training, which requires a large amount of labeled data with time-consuming pixel-wise annotations. Moreover, accurately segmenting lesions poses challenges due to variations in shape, size, and location. To address these issues, we propose a novel Dual-scale Enhanced and Cross-generative consistency learning framework for semi-supervised medical image Segmentation (DEC-Seg). First, we propose a Cross-level Feature Aggregation (CFA) module that integrates cross-level adjacent layers to enhance the feature representation ability across different resolutions. To address scale variation, we present a scale-enhanced consistency constraint, which ensures consistency in the segmentation maps generated from the same input image at different scales. This constraint helps handle variations in lesion sizes and improves the robustness of the model. Furthermore, we propose a cross-generative consistency scheme, in which the original and perturbed images can be reconstructed using cross-segmentation maps. This consistency constraint allows us to mine effective feature representations and boost the segmentation performance. To further exploit the scale information, we propose a Dual-scale Complementary Fusion (DCF) module that integrates features from two scale-specific decoders operating at different scales to help produce more accurate segmentation maps. Extensive experimental results on multiple medical segmentation tasks (polyp, skin lesion, and brain glioma) demonstrate the effectiveness of our DEC-Seg against other state-of-the-art semi-supervised segmentation approaches. The implementation code will be released at https://github.com/taozh2017/DECSeg.
comment: 12 pages 10 figures
♻ ☆ Pedestrian Attribute Recognition via CLIP based Prompt Vision-Language Fusion
Existing pedestrian attribute recognition (PAR) algorithms adopt pre-trained CNN (e.g., ResNet) as their backbone network for visual feature learning, which might obtain sub-optimal results due to the insufficient employment of the relations between pedestrian images and attribute labels. In this paper, we formulate PAR as a vision-language fusion problem and fully exploit the relations between pedestrian images and attribute labels. Specifically, the attribute phrases are first expanded into sentences, and then the pre-trained vision-language model CLIP is adopted as our backbone for feature embedding of visual images and attribute descriptions. The contrastive learning objective connects the vision and language modalities well in the CLIP-based feature space, and the Transformer layers used in CLIP can capture the long-range relations between pixels. Then, a multi-modal Transformer is adopted to fuse the dual features effectively and feed-forward network is used to predict attributes. To optimize our network efficiently, we propose the region-aware prompt tuning technique to adjust very few parameters (i.e., only the prompt vectors and classification heads) and fix both the pre-trained VL model and multi-modal Transformer. Our proposed PAR algorithm only adjusts 0.75% learnable parameters compared with the fine-tuning strategy. It also achieves new state-of-the-art performance on both standard and zero-shot settings for PAR, including RAPv1, RAPv2, WIDER, PA100K, and PETA-ZS, RAP-ZS datasets. The source code and pre-trained models will be released on https://github.com/Event-AHU/OpenPAR.
comment: Accepted by IEEE TCSVT 2024, Camera Ready Version
♻ ☆ Disease Classification and Impact of Pretrained Deep Convolution Neural Networks on Diverse Medical Imaging Datasets across Imaging Modalities
Imaging techniques such as Chest X-rays, whole slide images, and optical coherence tomography serve as the initial screening and detection for a wide variety of medical pulmonary and ophthalmic conditions respectively. This paper investigates the intricacies of using pretrained deep convolutional neural networks with transfer learning across diverse medical imaging datasets with varying modalities for binary and multiclass classification. We conducted a comprehensive performance analysis with ten network architectures and model families each with pretraining and random initialization. Our finding showed that the use of pretrained models as fixed feature extractors yields poor performance irrespective of the datasets. Contrary, histopathology microscopy whole slide images have better performance. It is also found that deeper and more complex architectures did not necessarily result in the best performance. This observation implies that the improvements in ImageNet are not parallel to the medical imaging tasks. Within a medical domain, the performance of the network architectures varies within model families with shifts in datasets. This indicates that the performance of models within a specific modality may not be conclusive for another modality within the same domain. This study provides a deeper understanding of the applications of deep learning techniques in medical imaging and highlights the impact of pretrained networks across different medical imaging datasets under five different experimental settings.
comment: 15 pages, 3 figures, 4 tables
♻ ☆ Instant Adversarial Purification with Adversarial Consistency Distillation
Neural networks, despite their remarkable performance in widespread applications, including image classification, are also known to be vulnerable to subtle adversarial noise. Although some diffusion-based purification methods have been proposed, for example, DiffPure, those methods are time-consuming. In this paper, we propose One Step Control Purification (OSCP), a diffusion-based purification model that can purify the adversarial image in one Neural Function Evaluation (NFE) in diffusion models. We use Latent Consistency Model (LCM) and ControlNet for our one-step purification. OSCP is computationally friendly and time efficient compared to other diffusion-based purification methods; we achieve defense success rate of 74.19\% on ImageNet, only requiring 0.1s for each purification. Moreover, there is a fundamental incongruence between consistency distillation and adversarial perturbation. To address this ontological dissonance, we propose Gaussian Adversarial Noise Distillation (GAND), a novel consistency distillation framework that facilitates a more nuanced reconciliation of the latent space dynamics, effectively bridging the natural and adversarial manifolds. Our experiments show that the GAND does not need a Full Fine Tune (FFT); PEFT, e.g., LoRA is sufficient.
♻ ☆ Show Me the World in My Language: Establishing the First Baseline for Scene-Text to Scene-Text Translation ICPR 2024
In this work, we study the task of ``visually'' translating scene text from a source language (e.g., Hindi) to a target language (e.g., English). Visual translation involves not just the recognition and translation of scene text but also the generation of the translated image that preserves visual features of the source scene text, such as font, size, and background. There are several challenges associated with this task, such as translation with limited context, deciding between translation and transliteration, accommodating varying text lengths within fixed spatial boundaries, and preserving the font and background styles of the source scene text in the target language. To address this problem, we make the following contributions: (i) We study visual translation as a standalone problem for the first time in the literature. (ii) We present a cascaded framework for visual translation that combines state-of-the-art modules for scene text recognition, machine translation, and scene text synthesis as a baseline for the task. (iii) We propose a set of task-specific design enhancements to design a variant of the baseline to obtain performance improvements. (iv) Currently, the existing related literature lacks any comprehensive performance evaluation for this novel task. To fill this gap, we introduce several automatic and user-assisted evaluation metrics designed explicitly for evaluating visual translation. Further, we evaluate presented baselines for translating scene text between Hindi and English. Our experiments demonstrate that although we can effectively perform visual translation over a large collection of scene text images, the presented baseline only partially addresses challenges posed by visual translation tasks. We firmly believe that this new task and the limitations of existing models, as reported in this paper, should encourage further research in visual translation.
comment: Accepted at ICPR 2024, Project Website: https://vl2g.github.io/projects/visTrans/
♻ ☆ An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models ECCV 2024
In this study, we identify the inefficient attention phenomena in Large Vision-Language Models (LVLMs), notably within prominent models like LLaVA-1.5, QwenVL-Chat and Video-LLaVA. We find out that the attention computation over visual tokens is of extreme inefficiency in the deep layers of popular LVLMs, suggesting a need for a sparser approach compared to textual data handling. To this end, we introduce FastV, a versatile plug-and-play method designed to optimize computational efficiency by learning adaptive attention patterns in early layers and pruning visual tokens in subsequent ones. Our evaluations demonstrate FastV's ability to dramatically reduce computational costs (e.g., a 45 reduction in FLOPs for LLaVA-1.5-13B) without sacrificing performance in a wide range of image and video understanding tasks. The computational efficiency and performance trade-off of FastV are highly customizable and pareto-efficient. It can compress the FLOPs of a 13B-parameter model to achieve a lower budget than that of a 7B-parameter model, while still maintaining superior performance. We believe FastV has practical values for deployment of LVLMs in edge devices and commercial models. Code is released at https://github.com/pkunlp-icler/FastV.
comment: Accepted to ECCV 2024 (Oral), code is released at https://github.com/pkunlp-icler/FastV,
♻ ☆ Video Diffusion Models are Strong Video Inpainter
Propagation-based video inpainting using optical flow at the pixel or feature level has recently garnered significant attention. However, it has limitations such as the inaccuracy of optical flow prediction and the propagation of noise over time. These issues result in non-uniform noise and time consistency problems throughout the video, which are particularly pronounced when the removed area is large and involves substantial movement. To address these issues, we propose a novel First Frame Filling Video Diffusion Inpainting model (FFF-VDI). We design FFF-VDI inspired by the capabilities of pre-trained image-to-video diffusion models that can transform the first frame image into a highly natural video. To apply this to the video inpainting task, we propagate the noise latent information of future frames to fill the masked areas of the first frame's noise latent code. Next, we fine-tune the pre-trained image-to-video diffusion model to generate the inpainted video. The proposed model addresses the limitations of existing methods that rely on optical flow quality, producing much more natural and temporally consistent videos. This proposed approach is the first to effectively integrate image-to-video diffusion models into video inpainting tasks. Through various comparative experiments, we demonstrate that the proposed model can robustly handle diverse inpainting types with high quality.
♻ ☆ A Grey-box Attack against Latent Diffusion Model-based Image Editing by Posterior Collapse
Recent advancements in generative AI, particularly Latent Diffusion Models (LDMs), have revolutionized image synthesis and manipulation. However, these generative techniques raises concerns about data misappropriation and intellectual property infringement. Adversarial attacks on machine learning models have been extensively studied, and a well-established body of research has extended these techniques as a benign metric to prevent the underlying misuse of generative AI. Current approaches to safeguarding images from manipulation by LDMs are limited by their reliance on model-specific knowledge and their inability to significantly degrade semantic quality of generated images. In response to these shortcomings, we propose the Posterior Collapse Attack (PCA) based on the observation that VAEs suffer from posterior collapse during training. Our method minimizes dependence on the white-box information of target models to get rid of the implicit reliance on model-specific knowledge. By accessing merely a small amount of LDM parameters, in specific merely the VAE encoder of LDMs, our method causes a substantial semantic collapse in generation quality, particularly in perceptual consistency, and demonstrates strong transferability across various model architectures. Experimental results show that PCA achieves superior perturbation effects on image generation of LDMs with lower runtime and VRAM. Our method outperforms existing techniques, offering a more robust and generalizable solution that is helpful in alleviating the socio-technical challenges posed by the rapidly evolving landscape of generative AI.
comment: 21 pages, 7 figures, 10 tables
♻ ☆ Event Voxel Set Transformer for Spatiotemporal Representation Learning on Event Streams
Event cameras are neuromorphic vision sensors that record a scene as sparse and asynchronous event streams. Most event-based methods project events into dense frames and process them using conventional vision models, resulting in high computational complexity. A recent trend is to develop point-based networks that achieve efficient event processing by learning sparse representations. However, existing works may lack robust local information aggregators and effective feature interaction operations, thus limiting their modeling capabilities. To this end, we propose an attention-aware model named Event Voxel Set Transformer (EVSTr) for efficient spatiotemporal representation learning on event streams. It first converts the event stream into voxel sets and then hierarchically aggregates voxel features to obtain robust representations. The core of EVSTr is an event voxel transformer encoder that consists of two well-designed components, including the Multi-Scale Neighbor Embedding Layer (MNEL) for local information aggregation and the Voxel Self-Attention Layer (VSAL) for global feature interaction. Enabling the network to incorporate a long-range temporal structure, we introduce a segment modeling strategy (S$^{2}$TM) to learn motion patterns from a sequence of segmented voxel sets. The proposed model is evaluated on two recognition tasks, including object classification and action recognition. To provide a convincing model evaluation, we present a new event-based action recognition dataset (NeuroHAR) recorded in challenging scenarios. Comprehensive experiments show that EVSTr achieves state-of-the-art performance while maintaining low model complexity.
comment: Accepted by IEEE Transactions on Circuits and Systems for Video Technology (TCSVT)
♻ ☆ MM-Soc: Benchmarking Multimodal Large Language Models in Social Media Platforms ACL 2024
Social media platforms are hubs for multimodal information exchange, encompassing text, images, and videos, making it challenging for machines to comprehend the information or emotions associated with interactions in online spaces. Multimodal Large Language Models (MLLMs) have emerged as a promising solution to these challenges, yet they struggle to accurately interpret human emotions and complex content such as misinformation. This paper introduces MM-Soc, a comprehensive benchmark designed to evaluate MLLMs' understanding of multimodal social media content. MM-Soc compiles prominent multimodal datasets and incorporates a novel large-scale YouTube tagging dataset, targeting a range of tasks from misinformation detection, hate speech detection, and social context generation. Through our exhaustive evaluation on ten size-variants of four open-source MLLMs, we have identified significant performance disparities, highlighting the need for advancements in models' social understanding capabilities. Our analysis reveals that, in a zero-shot setting, various types of MLLMs generally exhibit difficulties in handling social media tasks. However, MLLMs demonstrate performance improvements post fine-tuning, suggesting potential pathways for improvement. Our code and data are available at https://github.com/claws-lab/MMSoc.git.
comment: In Proceedings of ACL 2024
♻ ☆ Adapting Segment Anything Model to Multi-modal Salient Object Detection with Semantic Feature Fusion Guidance
Although most existing multi-modal salient object detection (SOD) methods demonstrate effectiveness through training models from scratch, the limited multi-modal data hinders these methods from reaching optimality. In this paper, we propose a novel framework to explore and exploit the powerful feature representation and zero-shot generalization ability of the pre-trained Segment Anything Model (SAM) for multi-modal SOD. Despite serving as a recent vision fundamental model, driving the class-agnostic SAM to comprehend and detect salient objects accurately is non-trivial, especially in challenging scenes. To this end, we develop \underline{SAM} with se\underline{m}antic f\underline{e}ature fu\underline{s}ion guidanc\underline{e} (Sammese), which incorporates multi-modal saliency-specific knowledge into SAM to adapt SAM to multi-modal SOD tasks. However, it is difficult for SAM trained on single-modal data to directly mine the complementary benefits of multi-modal inputs and comprehensively utilize them to achieve accurate saliency prediction. To address these issues, we first design a multi-modal complementary fusion module to extract robust multi-modal semantic features by integrating information from visible and thermal or depth image pairs. Then, we feed the extracted multi-modal semantic features into both the SAM image encoder and mask decoder for fine-tuning and prompting, respectively. Specifically, in the image encoder, a multi-modal adapter is proposed to adapt the single-modal SAM to multi-modal information. In the mask decoder, a semantic-geometric prompt generation strategy is proposed to produce corresponding embeddings with various saliency cues. Extensive experiments on both RGB-D and RGB-T SOD benchmarks show the effectiveness of the proposed framework. The code will be available at \url{https://github.com/Angknpng/Sammese}.
comment: 10 pages, 9 figures
♻ ☆ S-NeRF++: Autonomous Driving Simulation via Neural Reconstruction and Generation
Autonomous driving simulation system plays a crucial role in enhancing self-driving data and simulating complex and rare traffic scenarios, ensuring navigation safety. However, traditional simulation systems, which often heavily rely on manual modeling and 2D image editing, struggled with scaling to extensive scenes and generating realistic simulation data. In this study, we present S-NeRF++, an innovative autonomous driving simulation system based on neural reconstruction. Trained on widely-used self-driving datasets such as nuScenes and Waymo, S-NeRF++ can generate a large number of realistic street scenes and foreground objects with high rendering quality as well as offering considerable flexibility in manipulation and simulation. Specifically, S-NeRF++ is an enhanced neural radiance field for synthesizing large-scale scenes and moving vehicles, with improved scene parameterization and camera pose learning. The system effectively utilizes noisy and sparse LiDAR data to refine training and address depth outliers, ensuring high-quality reconstruction and novel-view rendering. It also provides a diverse foreground asset bank by reconstructing and generating different foreground vehicles to support comprehensive scenario creation.Moreover, we have developed an advanced foreground-background fusion pipeline that skillfully integrates illumination and shadow effects, further enhancing the realism of our simulations. With the high-quality simulated data provided by our S-NeRF++, we found the perception methods enjoy performance boosts on several autonomous driving downstream tasks, further demonstrating our proposed simulator's effectiveness.
♻ ☆ NutritionVerse: Empirical Study of Various Dietary Intake Estimation Approaches
Accurate dietary intake estimation is critical for informing policies and programs to support healthy eating, as malnutrition has been directly linked to decreased quality of life. However self-reporting methods such as food diaries suffer from substantial bias. Other conventional dietary assessment techniques and emerging alternative approaches such as mobile applications incur high time costs and may necessitate trained personnel. Recent work has focused on using computer vision and machine learning to automatically estimate dietary intake from food images, but the lack of comprehensive datasets with diverse viewpoints, modalities and food annotations hinders the accuracy and realism of such methods. To address this limitation, we introduce NutritionVerse-Synth, the first large-scale dataset of 84,984 photorealistic synthetic 2D food images with associated dietary information and multimodal annotations (including depth images, instance masks, and semantic masks). Additionally, we collect a real image dataset, NutritionVerse-Real, containing 889 images of 251 dishes to evaluate realism. Leveraging these novel datasets, we develop and benchmark NutritionVerse, an empirical study of various dietary intake estimation approaches, including indirect segmentation-based and direct prediction networks. We further fine-tune models pretrained on synthetic data with real images to provide insights into the fusion of synthetic and real data. Finally, we release both datasets (NutritionVerse-Synth, NutritionVerse-Real) on https://www.kaggle.com/nutritionverse/datasets as part of an open initiative to accelerate machine learning for dietary sensing.
comment: Corrections made to Tables 6, 7, and 8, and corrections made to Experiments Part C. Additional clarification made in Section 4
♻ ☆ DarkGS: Learning Neural Illumination and 3D Gaussians Relighting for Robotic Exploration in the Dark
Humans have the remarkable ability to construct consistent mental models of an environment, even under limited or varying levels of illumination. We wish to endow robots with this same capability. In this paper, we tackle the challenge of constructing a photorealistic scene representation under poorly illuminated conditions and with a moving light source. We approach the task of modeling illumination as a learning problem, and utilize the developed illumination model to aid in scene reconstruction. We introduce an innovative framework that uses a data-driven approach, Neural Light Simulators (NeLiS), to model and calibrate the camera-light system. Furthermore, we present DarkGS, a method that applies NeLiS to create a relightable 3D Gaussian scene model capable of real-time, photorealistic rendering from novel viewpoints. We show the applicability and robustness of our proposed simulator and system in a variety of real-world environments.
comment: 8 pages, 10 figures
♻ ☆ Towards Non-invasive and Personalized Management of Breast Cancer Patients from Multiparametric MRI via A Large Mixture-of-Modality-Experts Model
Breast magnetic resonance imaging (MRI) is the imaging technique with the highest sensitivity for detecting breast cancer and is routinely used for women at high risk. Despite the comprehensive multiparametric protocol of breast MRI, existing artificial intelligence-based studies predominantly rely on single sequences and have limited validation. Here we report a large mixture-of-modality-experts model (MOME) that integrates multiparametric MRI information within a unified structure, offering a noninvasive method for personalized breast cancer management. We have curated the largest multiparametric breast MRI dataset, involving 5,205 patients from three hospitals in the north, southeast, and southwest of China, for the development and extensive evaluation of our model. MOME demonstrated accurate and robust identification of breast cancer. It achieved comparable performance for malignancy recognition to that of four senior radiologists and significantly outperformed a junior radiologist, with 0.913 AUROC, 0.948 AUPRC, 0.905 F1 score, and 0.723 MCC. Our findings suggest that MOME could reduce the need for biopsies in BI-RADS 4 patients with a ratio of 7.3%, classify triple-negative breast cancer with an AUROC of 0.709, and predict pathological complete response to neoadjuvant chemotherapy with an AUROC of 0.694. The model further supports scalable and interpretable inference, adapting to missing modalities and providing decision explanations by highlighting lesions and measuring modality contributions. MOME exemplifies a discriminative, robust, scalable, and interpretable multimodal model, paving the way for noninvasive, personalized management of breast cancer patients based on multiparametric breast imaging data.
comment: 27 pages, 8 figures, 10 tables
Information Retrieval
☆ Sync from the Sea: Retrieving Alignable Videos from Large-Scale Datasets ECCV 2024
Temporal video alignment aims to synchronize the key events like object interactions or action phase transitions in two videos. Such methods could benefit various video editing, processing, and understanding tasks. However, existing approaches operate under the restrictive assumption that a suitable video pair for alignment is given, significantly limiting their broader applicability. To address this, we re-pose temporal alignment as a search problem and introduce the task of Alignable Video Retrieval (AVR). Given a query video, our approach can identify well-alignable videos from a large collection of clips and temporally synchronize them to the query. To achieve this, we make three key contributions: 1) we introduce DRAQ, a video alignability indicator to identify and re-rank the best alignable video from a set of candidates; 2) we propose an effective and generalizable frame-level video feature design to improve the alignment performance of several off-the-shelf feature representations, and 3) we propose a novel benchmark and evaluation protocol for AVR using cycle-consistency metrics. Our experiments on 3 datasets, including large-scale Kinetics700, demonstrate the effectiveness of our approach in identifying alignable video pairs from diverse datasets. Project Page: https://daveishan.github.io/avr-webpage/.
comment: ECCV 2024 Oral
☆ Know When to Fuse: Investigating Non-English Hybrid Retrieval in the Legal Domain
Hybrid search has emerged as an effective strategy to offset the limitations of different matching paradigms, especially in out-of-domain contexts where notable improvements in retrieval quality have been observed. However, existing research predominantly focuses on a limited set of retrieval methods, evaluated in pairs on domain-general datasets exclusively in English. In this work, we study the efficacy of hybrid search across a variety of prominent retrieval models within the unexplored field of law in the French language, assessing both zero-shot and in-domain scenarios. Our findings reveal that in a zero-shot context, fusing different domain-general models consistently enhances performance compared to using a standalone model, regardless of the fusion method. Surprisingly, when models are trained in-domain, we find that fusion generally diminishes performance relative to using the best single system, unless fusing scores with carefully tuned weights. These novel insights, among others, expand the applicability of prior findings across a new field and language, and contribute to a deeper understanding of hybrid search in non-English specialized domains.
comment: Under review
☆ SSD4Rec: A Structured State Space Duality Model for Efficient Sequential Recommendation
Sequential recommendation methods are crucial in modern recommender systems for their remarkable capability to understand a user's changing interests based on past interactions. However, a significant challenge faced by current methods (e.g., RNN- or Transformer-based models) is to effectively and efficiently capture users' preferences by modeling long behavior sequences, which impedes their various applications like short video platforms where user interactions are numerous. Recently, an emerging architecture named Mamba, built on state space models (SSM) with efficient hardware-aware designs, has showcased the tremendous potential for sequence modeling, presenting a compelling avenue for addressing the challenge effectively. Inspired by this, we propose a novel generic and efficient sequential recommendation backbone, SSD4Rec, which explores the seamless adaptation of Mamba for sequential recommendations. Specifically, SSD4Rec marks the variable- and long-length item sequences with sequence registers and processes the item representations with bidirectional Structured State Space Duality (SSD) blocks. This not only allows for hardware-aware matrix multiplication but also empowers outstanding capabilities in variable-length and long-range sequence modeling. Extensive evaluations on four benchmark datasets demonstrate that the proposed model achieves state-of-the-art performance while maintaining near-linear scalability with user sequence length. Our code is publicly available at https://github.com/ZhangYifeng1995/SSD4Rec.
☆ Real World Conversational Entity Linking Requires More Than Zeroshots
Entity linking (EL) in conversations faces notable challenges in practical applications, primarily due to the scarcity of entity-annotated conversational datasets and sparse knowledge bases (KB) containing domain-specific, long-tail entities. We designed targeted evaluation scenarios to measure the efficacy of EL models under resource constraints. Our evaluation employs two KBs: Fandom, exemplifying real-world EL complexities, and the widely used Wikipedia. First, we assess EL models' ability to generalize to a new unfamiliar KB using Fandom and a novel zero-shot conversational entity linking dataset that we curated based on Reddit discussions on Fandom entities. We then evaluate the adaptability of EL models to conversational settings without prior training. Our results indicate that current zero-shot EL models falter when introduced to new, domain-specific KBs without prior training, significantly dropping in performance. Our findings reveal that previous evaluation approaches fall short of capturing real-world complexities for zero-shot EL, highlighting the necessity for new approaches to design and assess conversational EL models to adapt to limited resources. The evaluation setup and the dataset proposed in this research are made publicly available.
☆ LLM-PQA: LLM-enhanced Prediction Query Answering CIKM 2024
The advent of Large Language Models (LLMs) provides an opportunity to change the way queries are processed, moving beyond the constraints of conventional SQL-based database systems. However, using an LLM to answer a prediction query is still challenging, since an external ML model has to be employed and inference has to be performed in order to provide an answer. This paper introduces LLM-PQA, a novel tool that addresses prediction queries formulated in natural language. LLM-PQA is the first to combine the capabilities of LLMs and retrieval-augmented mechanism for the needs of prediction queries by integrating data lakes and model zoos. This integration provides users with access to a vast spectrum of heterogeneous data and diverse ML models, facilitating dynamic prediction query answering. In addition, LLM-PQA can dynamically train models on demand, based on specific query requirements, ensuring reliable and relevant results even when no pre-trained model in a model zoo, available for the task.
comment: This paper is accepted as a demo at CIKM 2024
☆ Evidential Transformers for Improved Image Retrieval ECCV 2024
We introduce the Evidential Transformer, an uncertainty-driven transformer model for improved and robust image retrieval. In this paper, we make several contributions to content-based image retrieval (CBIR). We incorporate probabilistic methods into image retrieval, achieving robust and reliable results, with evidential classification surpassing traditional training based on multiclass classification as a baseline for deep metric learning. Furthermore, we improve the state-of-the-art retrieval results on several datasets by leveraging the Global Context Vision Transformer (GC ViT) architecture. Our experimental results consistently demonstrate the reliability of our approach, setting a new benchmark in CBIR in all test settings on the Stanford Online Products (SOP) and CUB-200-2011 datasets.
comment: 6 pages, 6 figures, To be presented at the 3rd Workshop on Uncertainty Quantification for Computer Vision, at the ECCV 2024 conference in Milan, Italy
☆ Improved Diversity-Promoting Collaborative Metric Learning for Recommendation
Collaborative Metric Learning (CML) has recently emerged as a popular method in recommendation systems (RS), closing the gap between metric learning and collaborative filtering. Following the convention of RS, existing practices exploit unique user representation in their model design. This paper focuses on a challenging scenario where a user has multiple categories of interests. Under this setting, the unique user representation might induce preference bias, especially when the item category distribution is imbalanced. To address this issue, we propose a novel method called \textit{Diversity-Promoting Collaborative Metric Learning} (DPCML), with the hope of considering the commonly ignored minority interest of the user. The key idea behind DPCML is to introduce a set of multiple representations for each user in the system where users' preference toward an item is aggregated by taking the minimum item-user distance among their embedding set. Specifically, we instantiate two effective assignment strategies to explore a proper quantity of vectors for each user. Meanwhile, a \textit{Diversity Control Regularization Scheme} (DCRS) is developed to accommodate the multi-vector representation strategy better. Theoretically, we show that DPCML could induce a smaller generalization error than traditional CML. Furthermore, we notice that CML-based approaches usually require \textit{negative sampling} to reduce the heavy computational burden caused by the pairwise objective therein. In this paper, we reveal the fundamental limitation of the widely adopted hard-aware sampling from the One-Way Partial AUC (OPAUC) perspective and then develop an effective sampling alternative for the CML-based paradigm. Finally, comprehensive experiments over a range of benchmark datasets speak to the efficacy of DPCML. Code are available at \url{https://github.com/statusrank/LibCML}.
comment: arXiv admin note: text overlap with arXiv:2209.15292
☆ Towards Investigating Biases in Spoken Conversational Search
Voice-based systems like Amazon Alexa, Google Assistant, and Apple Siri, along with the growing popularity of OpenAI's ChatGPT and Microsoft's Copilot, serve diverse populations, including visually impaired and low-literacy communities. This reflects a shift in user expectations from traditional search to more interactive question-answering models. However, presenting information effectively in voice-only channels remains challenging due to their linear nature. This limitation can impact the presentation of complex queries involving controversial topics with multiple perspectives. Failing to present diverse viewpoints may perpetuate or introduce biases and affect user attitudes. Balancing information load and addressing biases is crucial in designing a fair and effective voice-based system. To address this, we (i) review how biases and user attitude changes have been studied in screen-based web search, (ii) address challenges in studying these changes in voice-based settings like SCS, (iii) outline research questions, and (iv) propose an experimental setup with variables, data, and instruments to explore biases in a voice-based setting like Spoken Conversational Search.
comment: Accepted Late-Breaking Results at ACM ICMI Companion 2024
♻ ☆ Manipulating Large Language Models to Increase Product Visibility
Large language models (LLMs) are increasingly being integrated into search engines to provide natural language responses tailored to user queries. Customers and end-users are also becoming more dependent on these models for quick and easy purchase decisions. In this work, we investigate whether recommendations from LLMs can be manipulated to enhance a product's visibility. We demonstrate that adding a strategic text sequence (STS) -- a carefully crafted message -- to a product's information page can significantly increase its likelihood of being listed as the LLM's top recommendation. To understand the impact of STS, we use a catalog of fictitious coffee machines and analyze its effect on two target products: one that seldom appears in the LLM's recommendations and another that usually ranks second. We observe that the strategic text sequence significantly enhances the visibility of both products by increasing their chances of appearing as the top recommendation. This ability to manipulate LLM-generated search responses provides vendors with a considerable competitive advantage and has the potential to disrupt fair market competition. Just as search engine optimization (SEO) revolutionized how webpages are customized to rank higher in search engine results, influencing LLM recommendations could profoundly impact content optimization for AI-driven search services. Code for our experiments is available at https://github.com/aounon/llm-rank-optimizer.
♻ ☆ A multi-language toolkit for supporting automated checking of research outputs
This article presents the automatic checking of research outputs package acro, which assists researchers and data governance teams by automatically applying best-practice principles-based statistical disclosure control (SDC) techniques on-the-fly as researchers conduct their analyses. acro distinguishes between: research output that is safe to publish; output that requires further analysis; and output that cannot be published because it creates substantial risk of disclosing private data. This is achieved through the use of a lightweight Python wrapper that sits over well-known analysis tools that produce outputs such as tables, plots, and statistical models. This adds functionality to (i) identify potentially disclosive outputs against a range of commonly used disclosure tests; (ii) apply disclosure mitigation strategies where required; (iii) report reasons for applying SDC; and (iv) produce simple summary documents trusted research environment staff can use to streamline their workflow. The major analytical programming languages used by researchers are supported: Python, R, and Stata. The acro code and documentation are available under an MIT license at https://github.com/AI-SDC/ACRO
♻ ☆ VM-Rec: A Variational Mapping Approach for Cold-start User Recommendation
The cold-start problem is a common challenge for most recommender systems. The practical application of most cold-start methods is hindered by the deficiency in auxiliary content information for users. Moreover, most methods necessitate simultaneous updates to the extensive parameters of recommender models, leading to significant training costs, particularly in large-scale industrial scenarios. We observe that the model can generate expressive embeddings for warm users with relatively more interactions. Initially, these users were cold-start users, and after transitioning to warm users, they exhibit clustering patterns in their embeddings with consistent initial interactions. Based on this motivation, we propose a Variational Mapping approach for cold-start user Recommendation (VM-Rec), mapping from few initial interactions to expressive embeddings for cold-start users. Specifically, we encode the initial interactions into a latent representation, where each dimension disentangledly signifies the degree of association with each warm user. Subsequently, we utilize this latent representation as the parameters for the mapping function, mapping (decoding) it into an expressive embedding, which can be integrated into a pre-trained recommender model directly. Our method is evaluated on three datasets using the same base model, demonstrating superior performance compared to other popular cold-start methods.
♻ ☆ A Hybrid RAG System with Comprehensive Enhancement on Complex Reasoning KDD
Retrieval-augmented generation (RAG) is a framework enabling large language models (LLMs) to enhance their accuracy and reduce hallucinations by integrating external knowledge bases. In this paper, we introduce a hybrid RAG system enhanced through a comprehensive suite of optimizations that significantly improve retrieval quality, augment reasoning capabilities, and refine numerical computation ability. We refined the text chunks and tables in web pages, added attribute predictors to reduce hallucinations, conducted LLM Knowledge Extractor and Knowledge Graph Extractor, and finally built a reasoning strategy with all the references. We evaluated our system on the CRAG dataset through the Meta CRAG KDD Cup 2024 Competition. Both the local and online evaluations demonstrate that our system significantly enhances complex reasoning capabilities. In local evaluations, we have significantly improved accuracy and reduced error rates compared to the baseline model, achieving a notable increase in scores. In the meanwhile, we have attained outstanding results in online assessments, demonstrating the performance and generalization capabilities of the proposed system. The source code for our system is released in \url{https://gitlab.aicrowd.com/shizueyy/crag-new}.
comment: Technical report for 3rd prize in Task 1 of Meta CRAG KDD Cup 2024
♻ ☆ PEPT: Expert Finding Meets Personalized Pre-training
Finding experts is essential in Community Question Answering (CQA) platforms as it enables the effective routing of questions to potential users who can provide relevant answers. The key is to personalized learning expert representations based on their historical answered questions, and accurately matching them with target questions. There have been some preliminary works exploring the usability of PLMs in expert finding, such as pre-training expert or question representations. However, these models usually learn pure text representations of experts from histories, disregarding personalized and fine-grained expert modeling. For alleviating this, we present a personalized pre-training and fine-tuning paradigm, which could effectively learn expert interest and expertise simultaneously. Specifically, in our pre-training framework, we integrate historical answered questions of one expert with one target question, and regard it as a candidate aware expert-level input unit. Then, we fuse expert IDs into the pre-training for guiding the model to model personalized expert representations, which can help capture the unique characteristics and expertise of each individual expert. Additionally, in our pre-training task, we design: 1) a question-level masked language model task to learn the relatedness between histories, enabling the modeling of question-level expert interest; 2) a vote-oriented task to capture question-level expert expertise by predicting the vote score the expert would receive. Through our pre-training framework and tasks, our approach could holistically learn expert representations including interests and expertise. Our method has been extensively evaluated on six real-world CQA datasets, and the experimental results consistently demonstrate the superiority of our approach over competitive baseline methods.
Machine Learning
♻ ☆ Advanced Predictive Modeling for Enhanced Mortality Prediction in ICU Stroke Patients Using Clinical Data
Background: Stroke is second-leading cause of disability and death among adults. Approximately 17 million people suffer from a stroke annually, with about 85% being ischemic strokes. Predicting mortality of ischemic stroke patients in intensive care unit (ICU) is crucial for optimizing treatment strategies, allocating resources, and improving survival rates. Methods: We acquired data on ICU ischemic stroke patients from MIMIC-IV database, including diagnoses, vital signs, laboratory tests, medications, procedures, treatments, and clinical notes. Stroke patients were randomly divided into training (70%, n=2441), test (15%, n=523), and validation (15%, n=523) sets. To address data imbalances, we applied Synthetic Minority Over-sampling Technique (SMOTE). We selected 30 features for model development, significantly reducing feature number from 1095 used in the best study. We developed a deep learning model to assess mortality risk and implemented several baseline machine learning models for comparison. Results: XGB-DL model, combining XGBoost for feature selection and deep learning, effectively minimized false positives. Model's AUROC improved from 0.865 (95% CI: 0.821 - 0.905) on first day to 0.903 (95% CI: 0.868 - 0.936) by fourth day using data from 3,646 ICU mortality patients in the MIMIC-IV database with 0.945 AUROC (95% CI: 0.944 - 0.947) during training. Although other ML models also performed well in terms of AUROC, we chose Deep Learning for its higher specificity. Conclusions: Through enhanced feature selection and data cleaning, proposed model demonstrates a 13% AUROC improvement compared to existing models while reducing feature number from 1095 in previous studies to 30.
♻ ☆ AlphaFold Meets Flow Matching for Generating Protein Ensembles ICML 2024
The biological functions of proteins often depend on dynamic structural ensembles. In this work, we develop a flow-based generative modeling approach for learning and sampling the conformational landscapes of proteins. We repurpose highly accurate single-state predictors such as AlphaFold and ESMFold and fine-tune them under a custom flow matching framework to obtain sequence-conditoned generative models of protein structure called AlphaFlow and ESMFlow. When trained and evaluated on the PDB, our method provides a superior combination of precision and diversity compared to AlphaFold with MSA subsampling. When further trained on ensembles from all-atom MD, our method accurately captures conformational flexibility, positional distributions, and higher-order ensemble observables for unseen proteins. Moreover, our method can diversify a static PDB structure with faster wall-clock convergence to certain equilibrium properties than replicate MD trajectories, demonstrating its potential as a proxy for expensive physics-based simulations. Code is available at https://github.com/bjing2016/alphaflow.
comment: ICML 2024
♻ ☆ Automatic Differentiation is Essential in Training Neural Networks for Solving Differential Equations
Neural network-based approaches have recently shown significant promise in solving partial differential equations (PDEs) in science and engineering, especially in scenarios featuring complex domains or incorporation of empirical data. One advantage of the neural network methods for PDEs lies in its automatic differentiation (AD), which necessitates only the sample points themselves, unlike traditional finite difference (FD) approximations that require nearby local points to compute derivatives. In this paper, we quantitatively demonstrate the advantage of AD in training neural networks. The concept of truncated entropy is introduced to characterize the training property. Specifically, through comprehensive experimental and theoretical analyses conducted on random feature models and two-layer neural networks, we discover that the defined truncated entropy serves as a reliable metric for quantifying the residual loss of random feature models and the training speed of neural networks for both AD and FD methods. Our experimental and theoretical analyses demonstrate that, from a training perspective, AD outperforms FD in solving PDEs.
♻ ☆ On the limits of neural network explainability via descrambling
We characterize the exact solutions to neural network descrambling--a mathematical model for explaining the fully connected layers of trained neural networks (NNs). By reformulating the problem to the minimization of the Brockett function arising in graph matching and complexity theory we show that the principal components of the hidden layer preactivations can be characterized as the optimal explainers or descramblers for the layer weights, leading to descrambled weight matrices. We show that in typical deep learning contexts these descramblers take diverse and interesting forms including (1) matching largest principal components with the lowest frequency modes of the Fourier basis for isotropic hidden data, (2) discovering the semantic development in two-layer linear NNs for signal recovery problems, and (3) explaining CNNs by optimally permuting the neurons. Our numerical experiments indicate that the eigendecompositions of the hidden layer data--now understood as the descramblers--can also reveal the layer's underlying transformation. These results illustrate that the SVD is more directly related to the explainability of NNs than previously thought and offers a promising avenue for discovering interpretable motifs for the hidden action of NNs, especially in contexts of operator learning or physics-informed NNs, where the input/output data has limited human readability.
MAP: Low-compute Model Merging with Amortized Pareto Fronts via Quadratic Approximation
Model merging has emerged as an effective approach to combine multiple single-task models, fine-tuned from the same pre-trained model, into a multitask model. This process typically involves computing a weighted average of the model parameters without any additional training. Existing model-merging methods focus on enhancing average task accuracy. However, interference and conflicts between the objectives of different tasks can lead to trade-offs during model merging. In real-world applications, a set of solutions with various trade-offs can be more informative, helping practitioners make decisions based on diverse preferences. In this paper, we introduce a novel low-compute algorithm, Model Merging with Amortized Pareto Front (MAP). MAP identifies a Pareto set of scaling coefficients for merging multiple models to reflect the trade-offs. The core component of MAP is approximating the evaluation metrics of the various tasks using a quadratic approximation surrogate model derived from a pre-selected set of scaling coefficients, enabling amortized inference. Experimental results on vision and natural language processing tasks show that MAP can accurately identify the Pareto front. To further reduce the required computation of MAP, we propose (1) a Bayesian adaptive sampling algorithm and (2) a nested merging scheme with multiple stages.
♻ ☆ RISSOLE: Parameter-efficient Diffusion Models via Block-wise Generation and Retrieval-Guidance
Diffusion-based models demonstrate impressive generation capabilities. However, they also have a massive number of parameters, resulting in enormous model sizes, thus making them unsuitable for deployment on resource-constraint devices. Block-wise generation can be a promising alternative for designing compact-sized (parameter-efficient) deep generative models since the model can generate one block at a time instead of generating the whole image at once. However, block-wise generation is also considerably challenging because ensuring coherence across generated blocks can be non-trivial. To this end, we design a retrieval-augmented generation (RAG) approach and leverage the corresponding blocks of the images retrieved by the RAG module to condition the training and generation stages of a block-wise denoising diffusion model. Our conditioning schemes ensure coherence across the different blocks during training and, consequently, during generation. While we showcase our approach using the latent diffusion model (LDM) as the base model, it can be used with other variants of denoising diffusion models. We validate the solution of the coherence problem through the proposed approach by reporting substantive experiments to demonstrate our approach's effectiveness in compact model size and excellent generation quality.
♻ ☆ Uplift Modeling Under Limited Supervision
Estimating causal effects in e-commerce tends to involve costly treatment assignments which can be impractical in large-scale settings. Leveraging machine learning to predict such treatment effects without actual intervention is a standard practice to diminish the risk. However, existing methods for treatment effect prediction tend to rely on training sets of substantial size, which are built from real experiments and are thus inherently risky to create. In this work we propose a graph neural network to diminish the required training set size, relying on graphs that are common in e-commerce data. Specifically, we view the problem as node regression with a restricted number of labeled instances, develop a two-model neural architecture akin to previous causal effect estimators, and test varying message-passing layers for encoding. Furthermore, as an extra step, we combine the model with an acquisition function to guide the creation of the training set in settings with extremely low experimental budget. The framework is flexible since each step can be used separately with other models or treatment policies. The experiments on real large-scale networks indicate a clear advantage of our methodology over the state of the art, which in many cases performs close to random, underlining the need for models that can generalize with limited supervision to reduce experimental risks.
♻ ☆ Globally Stable Neural Imitation Policies
Imitation learning presents an effective approach to alleviate the resource-intensive and time-consuming nature of policy learning from scratch in the solution space. Even though the resulting policy can mimic expert demonstrations reliably, it often lacks predictability in unexplored regions of the state-space, giving rise to significant safety concerns in the face of perturbations. To address these challenges, we introduce the Stable Neural Dynamical System (SNDS), an imitation learning regime which produces a policy with formal stability guarantees. We deploy a neural policy architecture that facilitates the representation of stability based on Lyapunov theorem, and jointly train the policy and its corresponding Lyapunov candidate to ensure global stability. We validate our approach by conducting extensive experiments in simulation and successfully deploying the trained policies on a real-world manipulator arm. The experimental results demonstrate that our method overcomes the instability, accuracy, and computational intensity problems associated with previous imitation learning methods, making our method a promising solution for stable policy learning in complex planning scenarios.
AI-Assisted Generation of Difficult Math Questions
Current LLM training positions mathematical reasoning as a core capability. With publicly available sources fully tapped, there is unmet demand for diverse and challenging math questions. Relying solely on human experts is both time-consuming and costly, while LLM-generated questions often lack the requisite diversity and difficulty. We present a design framework that combines the strengths of LLMs with a human-in-the-loop approach to generate a diverse array of challenging math questions. We leverage LLM metacognition skills [Didolkar et al., 2024] of a strong LLM to extract core "skills" from existing math datasets. These skills serve as the basis for generating novel and difficult questions by prompting the LLM with random pairs of core skills. The use of two different skills within each question makes finding such questions an "out of distribution" task for both LLMs and humans. Our pipeline employs LLMs to iteratively generate and refine questions and solutions through multiturn prompting. Human annotators then verify and further refine the questions, with their efficiency enhanced via further LLM interactions. Applying this pipeline on skills extracted from the MATH dataset [Hendrycks et al., 2021] resulted in MATH$^2$ - a dataset of higher-quality math questions, as evidenced by: (a) Lower performance of all models on MATH$^2$ than on MATH (b) Higher performance on MATH when using MATH$^2$ questions as in-context examples. Although focused on mathematics, our methodology seems applicable to other domains requiring structured reasoning, and potentially as a component of scalable oversight. Also of interest is a striking relationship observed between models' performance on the new dataset: the success rate on MATH$^2$ is the square on MATH, suggesting that successfully solving the question in MATH$^2$ requires a nontrivial combination of two distinct math skills.
♻ ☆ Online Detection of Anomalies in Temporal Knowledge Graphs with Interpretability SIGMOD 2025
Temporal knowledge graphs (TKGs) are valuable resources for capturing evolving relationships among entities, yet they are often plagued by noise, necessitating robust anomaly detection mechanisms. Existing dynamic graph anomaly detection approaches struggle to capture the rich semantics introduced by node and edge categories within TKGs, while TKG embedding methods lack interpretability, undermining the credibility of anomaly detection. Moreover, these methods falter in adapting to pattern changes and semantic drifts resulting from knowledge updates. To tackle these challenges, we introduce AnoT, an efficient TKG summarization method tailored for interpretable online anomaly detection in TKGs. AnoT begins by summarizing a TKG into a novel rule graph, enabling flexible inference of complex patterns in TKGs. When new knowledge emerges, AnoT maps it onto a node in the rule graph and traverses the rule graph recursively to derive the anomaly score of the knowledge. The traversal yields reachable nodes that furnish interpretable evidence for the validity or the anomalous of the new knowledge. Overall, AnoT embodies a detector-updater-monitor architecture, encompassing a detector for offline TKG summarization and online scoring, an updater for real-time rule graph updates based on emerging knowledge, and a monitor for estimating the approximation error of the rule graph. Experimental results on four real-world datasets demonstrate that AnoT surpasses existing methods significantly in terms of accuracy and interoperability. All of the raw datasets and the implementation of AnoT are provided in https://github.com/zjs123/ANoT.
comment: 26 pages, 10 figures. Accepted by SIGMOD 2025
♻ ☆ NeuFair: Neural Network Fairness Repair with Dropout ISSTA 2024
This paper investigates neuron dropout as a post-processing bias mitigation for deep neural networks (DNNs). Neural-driven software solutions are increasingly applied in socially critical domains with significant fairness implications. While neural networks are exceptionally good at finding statistical patterns from data, they may encode and amplify existing biases from the historical data. Existing bias mitigation algorithms often require modifying the input dataset or the learning algorithms. We posit that the prevalent dropout methods that prevent over-fitting during training by randomly dropping neurons may be an effective and less intrusive approach to improve the fairness of pre-trained DNNs. However, finding the ideal set of neurons to drop is a combinatorial problem. We propose NeuFair, a family of post-processing randomized algorithms that mitigate unfairness in pre-trained DNNs via dropouts during inference after training. Our randomized search is guided by an objective to minimize discrimination while maintaining the model's utility. We show that our design of randomized algorithms is effective and efficient in improving fairness (up to 69%) with minimal or no model performance degradation. We provide intuitive explanations of these phenomena and carefully examine the influence of various hyperparameters of search algorithms on the results. Finally, we empirically and conceptually compare NeuFair to different state-of-the-art bias mitigators.
comment: Paper accepted at ACM ISSTA 2024
♻ ☆ Privacy-Aware Document Visual Question Answering ICDAR 2024
Document Visual Question Answering (DocVQA) has quickly grown into a central task of document understanding. But despite the fact that documents contain sensitive or copyrighted information, none of the current DocVQA methods offers strong privacy guarantees. In this work, we explore privacy in the domain of DocVQA for the first time, highlighting privacy issues in state of the art multi-modal LLM models used for DocVQA, and explore possible solutions. Specifically, we focus on invoice processing as a realistic document understanding scenario, and propose a large scale DocVQA dataset comprising invoice documents and associated questions and answers. We employ a federated learning scheme, that reflects the real-life distribution of documents in different businesses, and we explore the use case where the data of the invoice provider is the sensitive information to be protected. We demonstrate that non-private models tend to memorise, a behaviour that can lead to exposing private information. We then evaluate baseline training schemes employing federated learning and differential privacy in this multi-modal scenario, where the sensitive information might be exposed through either or both of the two input modalities: vision (document image) or language (OCR tokens). Finally, we design attacks exploiting the memorisation effect of the model, and demonstrate their effectiveness in probing a representative DocVQA models.
comment: 35 pages, 12 figures, accepted for publication at the 18th International Conference on Document Analysis and Recognition, ICDAR 2024
♻ ☆ Exploring Bias and Prediction Metrics to Characterise the Fairness of Machine Learning for Equity-Centered Public Health Decision-Making: A Narrative Review
Background: The rapid advancement of Machine Learning (ML) represents novel opportunities to enhance public health research, surveillance, and decision-making. However, there is a lack of comprehensive understanding of algorithmic bias, systematic errors in predicted population health outcomes, resulting from the public health application of ML. The objective of this narrative review is to explore the types of bias generated by ML and quantitative metrics to assess these biases. Methods : We performed search on PubMed, MEDLINE, IEEE (Institute of Electrical and Electronics Engineers), ACM (Association for Computing Machinery) Digital Library, Science Direct, and Springer Nature. We used keywords to identify studies describing types of bias and metrics to measure these in the domain of ML and public and population health published in English between 2008 and 2023, inclusive. Results: A total of 72 articles met the inclusion criteria. Our review identified the commonly described types of bias and quantitative metrics to assess these biases from an equity perspective. Conclusion : The review will help formalize the evaluation framework for ML on public health from an equity perspective.
comment: under review
♻ ☆ Does Data-Efficient Generalization Exacerbate Bias in Foundation Models? ECCV 2024
Foundation models have emerged as robust models with label efficiency in diverse domains. In medical imaging, these models contribute to the advancement of medical diagnoses due to the difficulty in obtaining labeled data. However, it is unclear whether using a large amount of unlabeled data, biased by the presence of sensitive attributes during pre-training, influences the fairness of the model. This research examines the bias in the Foundation model (RetFound) when it is applied to fine-tune the Brazilian Multilabel Ophthalmological Dataset (BRSET), which has a different population than the pre-training dataset. The model evaluation, in comparison with supervised learning, shows that the Foundation Model has the potential to reduce the gap between the maximum AUC and minimum AUC evaluations across gender and age groups. However, in a data-efficient generalization, the model increases the bias when the data amount decreases. These findings suggest that when deploying a Foundation Model in real-life scenarios with limited data, the possibility of fairness issues should be considered.
comment: Preprint of paper to be presented at Fairness and Ethics Towards Transparent AI: Facing the Challenge through Model Debiasing (FAILED) during ECCV 2024
♻ ☆ Ancestral Reinforcement Learning: Unifying Zeroth-Order Optimization and Genetic Algorithms for Reinforcement Learning
Reinforcement Learning (RL) offers a fundamental framework for discovering optimal action strategies through interactions within unknown environments. Recent advancement have shown that the performance and applicability of RL can significantly be enhanced by exploiting a population of agents in various ways. Zeroth-Order Optimization (ZOO) leverages an agent population to estimate the gradient of the objective function, enabling robust policy refinement even in non-differentiable scenarios. As another application, Genetic Algorithms (GA) boosts the exploration of policy landscapes by mutational generation of policy diversity in an agent population and its refinement by selection. A natural question is whether we can have the best of two worlds that the agent population can have. In this work, we propose Ancestral Reinforcement Learning (ARL), which synergistically combines the robust gradient estimation of ZOO with the exploratory power of GA. The key idea in ARL is that each agent within a population infers gradient by exploiting the history of its ancestors, i.e., the ancestor population in the past, while maintaining the diversity of policies in the current population as in GA. We also theoretically reveal that the populational search in ARL implicitly induces the KL-regularization of the objective function, resulting in the enhanced exploration. Our results extend the applicability of populational algorithms for RL.
comment: 16pages, 3 figures
♻ ☆ Analysis of Failures and Risks in Deep Learning Model Converters: A Case Study in the ONNX Ecosystem ISSTA'24
Software engineers develop, fine-tune, and deploy deep learning (DL) models using a variety of development frameworks and runtime environments. DL model converters move models between frameworks and to runtime environments. Conversion errors compromise model quality and disrupt deployment. However, the failure characteristics of DL model converters are unknown, adding risk when using DL interoperability technologies. This paper analyzes failures in DL model converters. We survey software engineers about DL interoperability tools, use cases, and pain points (N=92). Then, we characterize failures in model converters associated with the main interoperability tool, ONNX (N=200 issues in PyTorch and TensorFlow). Finally, we formulate and test two hypotheses about structural causes for the failures we studied. We find that the node conversion stage of a model converter accounts for ~75% of the defects and 33% of reported failure are related to semantically incorrect models. The cause of semantically incorrect models is elusive, but models with behaviour inconsistencies share operator sequences. Our results motivate future research on making DL interoperability software simpler to maintain, extend, and validate. Research into behavioural tolerances and architectural coverage metrics could be fruitful.
comment: [ISSTA'24] Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA) 2024
♻ ☆ HyperInterval: Hypernetwork approach to training weight interval regions in continual learning
Recently, a new Continual Learning (CL) paradigm was presented to control catastrophic forgetting, called Interval Continual Learning (InterContiNet), which relies on enforcing interval constraints on the neural network parameter space. Unfortunately, InterContiNet training is challenging due to the high dimensionality of the weight space, making intervals difficult to manage. To address this issue, we introduce \our{} \footnote{The source code is available at https://github.com/gmum/HyperInterval}, a technique that employs interval arithmetic within the embedding space and utilizes a hypernetwork to map these intervals to the target network parameter space. We train interval embeddings for consecutive tasks and train a hypernetwork to transform these embeddings into weights of the target network. An embedding for a given task is trained along with the hypernetwork, preserving the response of the target network for the previous task embeddings. Interval arithmetic works with a more manageable, lower-dimensional embedding space rather than directly preparing intervals in a high-dimensional weight space. Our model allows faster and more efficient training. Furthermore, \our{} maintains the guarantee of not forgetting. At the end of training, we can choose one universal embedding to produce a single network dedicated to all tasks. In such a framework, hypernetwork is used only for training and, finally, we can utilize one set of weights. \our{} obtains significantly better results than InterContiNet and gives SOTA results on several benchmarks.
♻ ☆ ResQuNNs:Towards Enabling Deep Learning in Quantum Convolution Neural Networks
In this paper, we present a novel framework for enhancing the performance of Quanvolutional Neural Networks (QuNNs) by introducing trainable quanvolutional layers and addressing the critical challenges associated with them. Traditional quanvolutional layers, although beneficial for feature extraction, have largely been static, offering limited adaptability. Unlike state-of-the-art, our research overcomes this limitation by enabling training within these layers, significantly increasing the flexibility and potential of QuNNs. However, the introduction of multiple trainable quanvolutional layers induces complexities in gradient-based optimization, primarily due to the difficulty in accessing gradients across these layers. To resolve this, we propose a novel architecture, Residual Quanvolutional Neural Networks (ResQuNNs), leveraging the concept of residual learning, which facilitates the flow of gradients by adding skip connections between layers. By inserting residual blocks between quanvolutional layers, we ensure enhanced gradient access throughout the network, leading to improved training performance. Moreover, we provide empirical evidence on the strategic placement of these residual blocks within QuNNs. Through extensive experimentation, we identify an efficient configuration of residual blocks, which enables gradients across all the layers in the network that eventually results in efficient training. Our findings suggest that the precise location of residual blocks plays a crucial role in maximizing the performance gains in QuNNs. Our results mark a substantial step forward in the evolution of quantum deep learning, offering new avenues for both theoretical development and practical quantum computing applications.
♻ ☆ Stabilizing Extreme Q-learning by Maclaurin Expansion
In offline reinforcement learning, in-sample learning methods have been widely used to prevent performance degradation caused by evaluating out-of-distribution actions from the dataset. Extreme Q-learning (XQL) employs a loss function based on the assumption that Bellman error follows a Gumbel distribution, enabling it to model the soft optimal value function in an in-sample manner. It has demonstrated strong performance in both offline and online reinforcement learning settings. However, issues remain, such as the instability caused by the exponential term in the loss function and the risk of the error distribution deviating from the Gumbel distribution. Therefore, we propose Maclaurin Expanded Extreme Q-learning to enhance stability. In this method, applying Maclaurin expansion to the loss function in XQL enhances stability against large errors. This approach involves adjusting the modeled value function between the value function under the behavior policy and the soft optimal value function, thus achieving a trade-off between stability and optimality depending on the order of expansion. It also enables adjustment of the error distribution assumption from a normal distribution to a Gumbel distribution. Our method significantly stabilizes learning in online RL tasks from DM Control, where XQL was previously unstable. Additionally, it improves performance in several offline RL tasks from D4RL.
comment: Accepted at RLC 2024: The first Reinforcement Learning Conference
♻ ☆ An Effective Information Theoretic Framework for Channel Pruning
Channel pruning is a promising method for accelerating and compressing convolutional neural networks. However, current pruning algorithms still remain unsolved problems that how to assign layer-wise pruning ratios properly and discard the least important channels with a convincing criterion. In this paper, we present a novel channel pruning approach via information theory and interpretability of neural networks. Specifically, we regard information entropy as the expected amount of information for convolutional layers. In addition, if we suppose a matrix as a system of linear equations, a higher-rank matrix represents there exist more solutions to it, which indicates more uncertainty. From the point of view of information theory, the rank can also describe the amount of information. In a neural network, considering the rank and entropy as two information indicators of convolutional layers, we propose a fusion function to reach a compromise of them, where the fusion results are defined as ``information concentration''. When pre-defining layer-wise pruning ratios, we employ the information concentration as a reference instead of heuristic and engineering tuning to provide a more interpretable solution. Moreover, we leverage Shapley values, which are a potent tool in the interpretability of neural networks, to evaluate the channel contributions and discard the least important channels for model compression while maintaining its performance. Extensive experiments demonstrate the effectiveness and promising performance of our method. For example, our method improves the accuracy by 0.21% when reducing 45.5% FLOPs and removing 40.3% parameters for ResNet-56 on CIFAR-10. Moreover, our method obtains loss in Top-1/Top-5 accuracies of 0.43%/0.11% by reducing 41.6% FLOPs and removing 35.0% parameters for ResNet-50 on ImageNet.
♻ ☆ Mamba3D: Enhancing Local Features for 3D Point Cloud Analysis via State Space Model ACM MM 2024
Existing Transformer-based models for point cloud analysis suffer from quadratic complexity, leading to compromised point cloud resolution and information loss. In contrast, the newly proposed Mamba model, based on state space models (SSM), outperforms Transformer in multiple areas with only linear complexity. However, the straightforward adoption of Mamba does not achieve satisfactory performance on point cloud tasks. In this work, we present Mamba3D, a state space model tailored for point cloud learning to enhance local feature extraction, achieving superior performance, high efficiency, and scalability potential. Specifically, we propose a simple yet effective Local Norm Pooling (LNP) block to extract local geometric features. Additionally, to obtain better global features, we introduce a bidirectional SSM (bi-SSM) with both a token forward SSM and a novel backward SSM that operates on the feature channel. Extensive experimental results show that Mamba3D surpasses Transformer-based counterparts and concurrent works in multiple tasks, with or without pre-training. Notably, Mamba3D achieves multiple SoTA, including an overall accuracy of 92.6% (train from scratch) on the ScanObjectNN and 95.1% (with single-modal pre-training) on the ModelNet40 classification task, with only linear complexity. Our code and weights are available at https://github.com/xhanxu/Mamba3D.
comment: ACM MM 2024. Code and weights are available at https://github.com/xhanxu/Mamba3D
♻ ☆ Identifying Weight-Variant Latent Causal Models
The task of causal representation learning aims to uncover latent higher-level causal representations that affect lower-level observations. Identifying true latent causal representations from observed data, while allowing instantaneous causal relations among latent variables, remains a challenge, however. To this end, we start from the analysis of three intrinsic properties in identifying latent space from observations: transitivity, permutation indeterminacy, and scaling indeterminacy. We find that transitivity acts as a key role in impeding the identifiability of latent causal representations. To address the unidentifiable issue due to transitivity, we introduce a novel identifiability condition where the underlying latent causal model satisfies a linear-Gaussian model, in which the causal coefficients and the distribution of Gaussian noise are modulated by an additional observed variable. Under some mild assumptions, we can show that the latent causal representations can be identified up to trivial permutation and scaling. Furthermore, based on this theoretical result, we propose a novel method, termed Structural caUsAl Variational autoEncoder, which directly learns latent causal representations and causal relationships among them, together with the mapping from the latent causal variables to the observed ones. We show that the proposed method learns the true parameters asymptotically. Experimental results on synthetic and real data demonstrate the identifiability and consistency results and the efficacy of the proposed method in learning latent causal representations.
♻ ☆ Deep Learning-based Target-To-User Association in Integrated Sensing and Communication Systems
In Integrated Sensing and Communication (ISAC) systems, matching the radar targets with communication user equipments (UEs) is functional to several communication tasks, such as proactive handover and beam prediction. In this paper, we consider a radar-assisted communication system where a base station (BS) is equipped with a multiple-input-multiple-output (MIMO) radar that has a double aim: (i) associate vehicular radar targets to vehicular equipments (VEs) in the communication beamspace and (ii) predict the beamforming vector for each VE from radar data. The proposed target-to-user (T2U) association consists of two stages. First, vehicular radar targets are detected from range-angle images, and, for each, a beamforming vector is estimated. Then, the inferred per-target beamforming vectors are matched with the ones utilized at the BS for communication to perform target-to-user (T2U) association. Joint multi-target detection and beam inference is obtained by modifying the you only look once (YOLO) model, which is trained over simulated range-angle radar images. Simulation results over different urban vehicular mobility scenarios show that the proposed T2U method provides a probability of correct association that increases with the size of the BS antenna array, highlighting the respective increase of the separability of the VEs in the beamspace. Moreover, we show that the modified YOLO architecture can effectively perform both beam prediction and radar target detection, with similar performance in mean average precision on the latter over different antenna array sizes.
♻ ☆ Barlow Twins Deep Neural Network for Advanced 1D Drug-Target Interaction Prediction
Accurate prediction of drug-target interactions is critical for advancing drug discovery. By reducing time and cost, machine learning and deep learning can accelerate this laborious discovery process. In a novel approach, BarlowDTI, we utilise the powerful Barlow Twins architecture for feature-extraction while considering the structure of the target protein. Our method achieves state-of-the-art predictive performance against multiple established benchmarks using only one-dimensional input. The use of gradient boosting machine as the underlying predictor ensures fast and efficient predictions without the need for substantial computational resources. We also investigate how the model reaches its decision based on individual training samples. By comparing co-crystal structures, we find that BarlowDTI effectively exploits catalytically active and stabilising residues, highlighting the model's ability to generalise from one-dimensional input data. In addition, we further benchmark new baselines against existing methods. Together, these innovations improve the efficiency and effectiveness of drug-target interaction predictions, providing robust tools for accelerating drug development and deepening the understanding of molecular interactions. Therefore, we provide an easy-to-use web interface that can be freely accessed at https://www.bio.nat.tum.de/oc2/barlowdti .
comment: Refined model architecture, additional results added
♻ ☆ TRNet: Two-level Refinement Network leveraging Speech Enhancement for Noise Robust Speech Emotion Recognition
One persistent challenge in Speech Emotion Recognition (SER) is the ubiquitous environmental noise, which frequently results in deteriorating SER performance in practice. In this paper, we introduce a Two-level Refinement Network, dubbed TRNet, to address this challenge. Specifically, a pre-trained speech enhancement module is employed for front-end noise reduction and noise level estimation. Later, we utilize clean speech spectrograms and their corresponding deep representations as reference signals to refine the spectrogram distortion and representation shift of enhanced speech during model training. Experimental results validate that the proposed TRNet substantially promotes the robustness of the proposed system in both matched and unmatched noisy environments, without compromising its performance in noise-free environments.
comment: 14 pages, 3 figures
♻ ☆ DiffLoad: Uncertainty Quantification in Electrical Load Forecasting with the Diffusion Model
Electrical load forecasting plays a crucial role in decision-making for power systems, including unit commitment and economic dispatch. The integration of renewable energy sources and the occurrence of external events, such as the COVID-19 pandemic, have rapidly increased uncertainties in load forecasting. The uncertainties in load forecasting can be divided into two types: epistemic uncertainty and aleatoric uncertainty. Separating these types of uncertainties can help decision-makers better understand where and to what extent the uncertainty is, thereby enhancing their confidence in the following decision-making. This paper proposes a diffusion-based Seq2Seq structure to estimate epistemic uncertainty and employs the robust additive Cauchy distribution to estimate aleatoric uncertainty. Our method not only ensures the accuracy of load forecasting but also demonstrates the ability to separate the two types of uncertainties and be applicable to different levels of loads. The relevant code can be found at \url{https://anonymous.4open.science/r/DiffLoad-4714/}.
comment: Accepted by IEEE Transactions on Power Systems, 2024
♻ ☆ Near-Optimal Policy Identification in Robust Constrained Markov Decision Processes via Epigraph Form
Designing a safe policy for uncertain environments is crucial in real-world control applications. However, this challenge remains inadequately addressed within the Markov decision process (MDP) framework. This paper presents the first algorithm capable of identifying a near-optimal policy in a robust constrained MDP (RCMDP), where an optimal policy minimizes cumulative cost while satisfying constraints in the worst-case scenario across a set of environments. We first prove that the conventional Lagrangian max-min formulation with policy gradient methods can become trapped in suboptimal solutions by encountering a sum of conflicting gradients from the objective and constraint functions during its inner minimization problem. To address this, we leverage the epigraph form of the RCMDP problem, which resolves the conflict by selecting a single gradient from either the objective or the constraints. Building on the epigraph form, we propose a binary search algorithm with a policy gradient subroutine and prove that it identifies an $\varepsilon$-optimal policy in an RCMDP with $\tilde{\mathcal{O}}(\varepsilon^{-4})$ policy evaluations.
♻ ☆ EuroPED-NN: Uncertainty aware surrogate model
This work successfully generates an uncertainty-aware surrogate model of the EuroPED plasma pedestal model using the Bayesian neural network with noise contrastive prior (BNN-NCP) technique. This model is trained using data from the JET-ILW pedestal database and subsequent model evaluations, conforming to EuroPED-NN. The BNN-NCP technique has been proven to be a suitable method for generating uncertainty-aware surrogate models. It matches the output results of a regular neural network while providing confidence estimates for predictions as uncertainties. Additionally, it highlights out-of-distribution (OOD) regions using surrogate model uncertainties. This provides critical insights into model robustness and reliability. EuroPED-NN has been physically validated, first, analyzing electron density $n_e\!\left(\psi_{\text{pol}}=0.94\right)$ with respect to increasing plasma current, $I_p$, and second, validating the $\Delta-\beta_{p,ped}$ relation associated with the EuroPED model. This affirms the robustness of the underlying physics learned by the surrogate model. On top of that, the method was used to develop a EuroPED-like model fed with experimental data, i.e. an uncertainty aware experimental model, which is functional in JET database. Both models have been also tested in $\sim 50$ AUG shots.
♻ ☆ The Initial Screening Order Problem
We investigate the role of the initial screening order (ISO) in candidate screening tasks, such as employee hiring and academic admissions, in which a screener is tasked with selecting $k$ candidates from a candidate pool. The ISO refers to the order in which the screener searches the candidate pool. Today, it is common for the ISO to be the product of an information access system, such as an online platform or a database query. The ISO has been largely overlooked in the literature, despite its potential impact on the optimality and fairness of the chosen $k$ candidates, especially under a human screener. We define two problem formulations describing the search behavior of the screener under the ISO: the best-$k$, where the screener selects the $k$ best candidates; and the good-$k$, where the screener selects the $k$ first good-enough candidates. To study the impact of the ISO, we introduce a human-like screener and compare it to its algorithmic counterpart, where the human-like screener is conceived to be inconsistent over time due to fatigue. In particular, our analysis shows that the ISO, under a human-like screener solving for the good-$k$ problem, hinders individual fairness despite meeting group level fairness, and hampers the optimality of the selected $k$ candidates. This is due to position bias, where a candidate's evaluation is affected by its position within the ISO. We report extensive simulated experiments exploring the parameters of the best-$k$ and good-$k$ problems for the algorithmic and human-like screeners. The simulation framework is flexible enough to account for multiple screening settings, being an alternative to running real-world candidate screening procedures. This work is motivated by a real-world candidate screening problem studied in collaboration with an European company.
♻ ☆ Enabling Local Editing in Diffusion Models by Joint and Individual Component Analysis BMVC2024
Recent advances in Diffusion Models (DMs) have led to significant progress in visual synthesis and editing tasks, establishing them as a strong competitor to Generative Adversarial Networks (GANs). However, the latent space of DMs is not as well understood as that of GANs. Recent research has focused on unsupervised semantic discovery in the latent space of DMs by leveraging the bottleneck layer of the denoising network, which has been shown to exhibit properties of a semantic latent space. However, these approaches are limited to discovering global attributes. In this paper we address, the challenge of local image manipulation in DMs and introduce an unsupervised method to factorize the latent semantics learned by the denoising network of pre-trained DMs. Given an arbitrary image and defined regions of interest, we utilize the Jacobian of the denoising network to establish a relation between the regions of interest and their corresponding subspaces in the latent space. Furthermore, we disentangle the joint and individual components of these subspaces to identify latent directions that enable local image manipulation. Once discovered, these directions can be applied to different images to produce semantically consistent edits, making our method suitable for practical applications. Experimental results on various datasets demonstrate that our method can produce semantic edits that are more localized and have better fidelity compared to the state-of-the-art.
comment: Accepted at BMVC2024
♻ ☆ Autonomous Payload Thermal Control SP
In small satellites there is less room for heat control equipment, scientific instruments, and electronic components. Furthermore, the near proximity of electronic components makes power dissipation difficult, with the risk of not being able to control the temperature appropriately, reducing component lifetime and mission performance. To address this challenge, taking advantage of the advent of increasing intelligence on board satellites, an autonomous thermal control tool that uses deep reinforcement learning is proposed for learning the thermal control policy onboard. The tool was evaluated in a real space edge processing computer that will be used in a demonstration payload hosted in the International Space Station (ISS). The experiment results show that the proposed framework is able to learn to control the payload processing power to maintain the temperature under operational ranges, complementing traditional thermal control systems.
comment: To be included in the proceedings of ESA's SPAICE conference at ECSAT, UK, 2024
♻ ☆ Fast Robust Kernel Regression through Sign Gradient Descent with Early Stopping
Kernel ridge regression, KRR, is a generalization of linear ridge regression that is non-linear in the data, but linear in the model parameters. Here, we introduce an equivalent formulation of the objective function of KRR, which opens up both for replacing the ridge penalty with the $\ell_\infty$ and $\ell_1$ penalties and for studying kernel ridge regression from the perspective of gradient descent. Using the $\ell_\infty$ and $\ell_1$ penalties, we obtain robust and sparse kernel regression, respectively. We further study the similarities between explicitly regularized kernel regression and the solutions obtained by early stopping of iterative gradient-based methods, where we connect $\ell_\infty$ regularization to sign gradient descent, $\ell_1$ regularization to forward stagewise regression (also known as coordinate descent), and $\ell_2$ regularization to gradient descent, and, in the last case, theoretically bound for the differences. We exploit the close relations between $\ell_\infty$ regularization and sign gradient descent, and between $\ell_1$ regularization and coordinate descent to propose computationally efficient methods for robust and sparse kernel regression. We finally compare robust kernel regression through sign gradient descent to existing methods for robust kernel regression on five real data sets, demonstrating that our method is one to two orders of magnitude faster, without compromising accuracy.
comment: Article arXiv:2306.16838v1 has been updated and split into two articles: this article and arXiv:2311.01762. Thus, some of the content in arXiv:2306.16838v1 is not a part of arXiv:2306.16838v2, but of arXiv:2311.01762
♻ ☆ Simplifying the Theory on Over-Smoothing
Graph convolutions have gained popularity due to their ability to efficiently operate on data with an irregular geometric structure. However, graph convolutions cause over-smoothing, which refers to representations becoming more similar with increased depth. However, many different definitions and intuitions currently coexist, leading to research efforts focusing on incompatible directions. This paper attempts to align these directions by showing that over-smoothing is merely a special case of power iteration. This greatly simplifies the existing theory on over-smoothing, making it more accessible. Based on the theory, we provide a novel comprehensive definition of rank collapse as a generalized form of over-smoothing and introduce the rank-one distance as a corresponding metric. Our empirical evaluation of 14 commonly used methods shows that more models than were previously known suffer from this issue.
♻ ☆ BadMerging: Backdoor Attacks Against Model Merging CCS
Fine-tuning pre-trained models for downstream tasks has led to a proliferation of open-sourced task-specific models. Recently, Model Merging (MM) has emerged as an effective approach to facilitate knowledge transfer among these independently fine-tuned models. MM directly combines multiple fine-tuned task-specific models into a merged model without additional training, and the resulting model shows enhanced capabilities in multiple tasks. Although MM provides great utility, it may come with security risks because an adversary can exploit MM to affect multiple downstream tasks. However, the security risks of MM have barely been studied. In this paper, we first find that MM, as a new learning paradigm, introduces unique challenges for existing backdoor attacks due to the merging process. To address these challenges, we introduce BadMerging, the first backdoor attack specifically designed for MM. Notably, BadMerging allows an adversary to compromise the entire merged model by contributing as few as one backdoored task-specific model. BadMerging comprises a two-stage attack mechanism and a novel feature-interpolation-based loss to enhance the robustness of embedded backdoors against the changes of different merging parameters. Considering that a merged model may incorporate tasks from different domains, BadMerging can jointly compromise the tasks provided by the adversary (on-task attack) and other contributors (off-task attack) and solve the corresponding unique challenges with novel attack designs. Extensive experiments show that BadMerging achieves remarkable attacks against various MM algorithms. Our ablation study demonstrates that the proposed attack designs can progressively contribute to the attack performance. Finally, we show that prior defense mechanisms fail to defend against our attacks, highlighting the need for more advanced defense.
comment: To appear in ACM Conference on Computer and Communications Security (CCS), 2024
♻ ☆ OriGen:Enhancing RTL Code Generation with Code-to-Code Augmentation and Self-Reflection
Recent studies have demonstrated the significant potential of Large Language Models (LLMs) in generating Register Transfer Level (RTL) code, with notable advancements showcased by commercial models such as GPT-4 and Claude3-Opus. However, these proprietary LLMs often raise concerns regarding privacy and security. While open-source LLMs offer solutions to these concerns, they typically underperform commercial models in RTL code generation tasks, primarily due to the scarcity of high-quality open-source RTL datasets. To address this challenge, we introduce OriGen , a fully open-source framework that incorporates self-reflection capabilities and a novel dataset augmentation methodology for generating high-quality, large-scale RTL code. Our approach employs a code-tocode augmentation technique to enhance the quality of open-source RTL code datasets. Furthermore, OriGen can rectify syntactic errors through a self-reflection process that leverages compiler feedback. Experimental results demonstrate that OriGen significantly outperforms other open-source alternatives in RTL code generation. It surpasses the previous best-performing open-source LLM by 12.8% and even exceeds GPT-4 Turbo in the pass@1 metric on the VerilogEval-Human benchmark. Moreover, OriGen exhibits superior capabilities in self-reflection and error correction, outperforming GPT-4 by 19.9% on a benchmark designed to evaluate self-reflection capabilities.
♻ ☆ Biometrics and Behavior Analysis for Detecting Distractions in e-Learning
In this article, we explore computer vision approaches to detect abnormal head pose during e-learning sessions and we introduce a study on the effects of mobile phone usage during these sessions. We utilize behavioral data collected from 120 learners monitored while participating in a MOOC learning sessions. Our study focuses on the influence of phone-usage events on behavior and physiological responses, specifically attention, heart rate, and meditation, before, during, and after phone usage. Additionally, we propose an approach for estimating head pose events using images taken by the webcam during the MOOC learning sessions to detect phone-usage events. Our hypothesis suggests that head posture undergoes significant changes when learners interact with a mobile phone, contrasting with the typical behavior seen when learners face a computer during e-learning sessions. We propose an approach designed to detect deviations in head posture from the average observed during a learner's session, operating as a semi-supervised method. This system flags events indicating alterations in head posture for subsequent human review and selection of mobile phone usage occurrences with a sensitivity over 90%.
comment: Published in IEEE Intl. Symposium on Computers in Education (SIIE) 2024
♻ ☆ VAAD: Visual Attention Analysis Dashboard applied to e-Learning
In this paper, we present an approach in the Multimodal Learning Analytics field. Within this approach, we have developed a tool to visualize and analyze eye movement data collected during learning sessions in online courses. The tool is named VAAD, an acronym for Visual Attention Analysis Dashboard. These eye movement data have been gathered using an eye-tracker and subsequently processed and visualized for interpretation. The purpose of the tool is to conduct a descriptive analysis of the data by facilitating its visualization, enabling the identification of differences and learning patterns among various learner populations. Additionally, it integrates a predictive module capable of anticipating learner activities during a learning session. Consequently, VAAD holds the potential to offer valuable insights into online learning behaviors from both descriptive and predictive perspectives.
comment: Published in IEEE Intl. Symposium on Computers in Education (SIIE) 2024
♻ ☆ From Static to Dynamic Structures: Improving Binding Affinity Prediction with Graph-Based Deep Learning
Accurate prediction of protein-ligand binding affinities is an essential challenge in structure-based drug design. Despite recent advances in data-driven methods for affinity prediction, their accuracy is still limited, partially because they only take advantage of static crystal structures while the actual binding affinities are generally determined by the thermodynamic ensembles between proteins and ligands. One effective way to approximate such a thermodynamic ensemble is to use molecular dynamics (MD) simulation. Here, an MD dataset containing 3,218 different protein-ligand complexes is curated, and Dynaformer, a graph-based deep learning model is further developed to predict the binding affinities by learning the geometric characteristics of the protein-ligand interactions from the MD trajectories. In silico experiments demonstrated that the model exhibits state-of-the-art scoring and ranking power on the CASF-2016 benchmark dataset, outperforming the methods hitherto reported. Moreover, in a virtual screening on heat shock protein 90 (HSP90) using Dynaformer, 20 candidates are identified and their binding affinities are further experimentally validated. Dynaformer displayed promising results in virtual drug screening, revealing 12 hit compounds (two are in the submicromolar range), including several novel scaffolds. Overall, these results demonstrated that the approach offer a promising avenue for accelerating the early drug discovery process.
comment: Update the content according to the published version on Advanced Science (https://doi.org/10.1002/advs.202405404)
♻ ☆ Directly Handling Missing Data in Linear Discriminant Analysis for Enhancing Classification Accuracy and Interpretability
As the adoption of Artificial Intelligence (AI) models expands into critical real-world applications, ensuring the explainability of these models becomes paramount, particularly in sensitive fields such as medicine and finance. Linear Discriminant Analysis (LDA) remains a popular choice for classification due to its interpretable nature, derived from its capacity to model class distributions and enhance class separation through linear combinations of features. However, real-world datasets often suffer from incomplete data, posing substantial challenges for both classification accuracy and model interpretability. In this paper, we introduce a novel and robust classification method, termed Weighted missing Linear Discriminant Analysis (WLDA), which extends LDA to handle datasets with missing values without the need for imputation. Our approach innovatively incorporates a weight matrix that penalizes missing entries, thereby refining parameter estimation directly on incomplete data. This methodology not only preserves the interpretability of LDA but also significantly enhances classification performance in scenarios plagued by missing data. We conduct an in-depth theoretical analysis to establish the properties of WLDA and thoroughly evaluate its explainability. Experimental results across various datasets demonstrate that WLDA consistently outperforms traditional methods, especially in challenging environments where missing values are prevalent in both training and test datasets. This advancement provides a critical tool for improving classification accuracy and maintaining model transparency in the face of incomplete data.
♻ ☆ ERATTA: Extreme RAG for Table To Answers with Large Language Models
Large language models (LLMs) with retrieval augmented-generation (RAG) have been the optimal choice for scalable generative AI solutions in the recent past. Although RAG implemented with AI agents (agentic-RAG) has been recently popularized, its suffers from unstable cost and unreliable performances for Enterprise-level data-practices. Most existing use-cases that incorporate RAG with LLMs have been either generic or extremely domain specific, thereby questioning the scalability and generalizability of RAG-LLM approaches. In this work, we propose a unique LLM-based system where multiple LLMs can be invoked to enable data authentication, user-query routing, data-retrieval and custom prompting for question-answering capabilities from Enterprise-data tables. The source tables here are highly fluctuating and large in size and the proposed framework enables structured responses in under 10 seconds per query. Additionally, we propose a five metric scoring module that detects and reports hallucinations in the LLM responses. Our proposed system and scoring metrics achieve >90% confidence scores across hundreds of user queries in the sustainability, financial health and social media domains. Extensions to the proposed extreme RAG architectures can enable heterogeneous source querying using LLMs.
comment: 5 pages, 4 tables, IEEE Big Data, 2024
♻ ☆ Disease Classification and Impact of Pretrained Deep Convolution Neural Networks on Diverse Medical Imaging Datasets across Imaging Modalities
Imaging techniques such as Chest X-rays, whole slide images, and optical coherence tomography serve as the initial screening and detection for a wide variety of medical pulmonary and ophthalmic conditions respectively. This paper investigates the intricacies of using pretrained deep convolutional neural networks with transfer learning across diverse medical imaging datasets with varying modalities for binary and multiclass classification. We conducted a comprehensive performance analysis with ten network architectures and model families each with pretraining and random initialization. Our finding showed that the use of pretrained models as fixed feature extractors yields poor performance irrespective of the datasets. Contrary, histopathology microscopy whole slide images have better performance. It is also found that deeper and more complex architectures did not necessarily result in the best performance. This observation implies that the improvements in ImageNet are not parallel to the medical imaging tasks. Within a medical domain, the performance of the network architectures varies within model families with shifts in datasets. This indicates that the performance of models within a specific modality may not be conclusive for another modality within the same domain. This study provides a deeper understanding of the applications of deep learning techniques in medical imaging and highlights the impact of pretrained networks across different medical imaging datasets under five different experimental settings.
comment: 15 pages, 3 figures, 4 tables
♻ ☆ PeRFlow: Piecewise Rectified Flow as Universal Plug-and-Play Accelerator
We present Piecewise Rectified Flow (PeRFlow), a flow-based method for accelerating diffusion models. PeRFlow divides the sampling process of generative flows into several time windows and straightens the trajectories in each interval via the reflow operation, thereby approaching piecewise linear flows. PeRFlow achieves superior performance in a few-step generation. Moreover, through dedicated parameterizations, the PeRFlow models inherit knowledge from the pretrained diffusion models. Thus, the training converges fast and the obtained models show advantageous transfer ability, serving as universal plug-and-play accelerators that are compatible with various workflows based on the pre-trained diffusion models. Codes for training and inference are publicly released. https://github.com/magic-research/piecewise-rectified-flow
♻ ☆ Instant Adversarial Purification with Adversarial Consistency Distillation
Neural networks, despite their remarkable performance in widespread applications, including image classification, are also known to be vulnerable to subtle adversarial noise. Although some diffusion-based purification methods have been proposed, for example, DiffPure, those methods are time-consuming. In this paper, we propose One Step Control Purification (OSCP), a diffusion-based purification model that can purify the adversarial image in one Neural Function Evaluation (NFE) in diffusion models. We use Latent Consistency Model (LCM) and ControlNet for our one-step purification. OSCP is computationally friendly and time efficient compared to other diffusion-based purification methods; we achieve defense success rate of 74.19\% on ImageNet, only requiring 0.1s for each purification. Moreover, there is a fundamental incongruence between consistency distillation and adversarial perturbation. To address this ontological dissonance, we propose Gaussian Adversarial Noise Distillation (GAND), a novel consistency distillation framework that facilitates a more nuanced reconciliation of the latent space dynamics, effectively bridging the natural and adversarial manifolds. Our experiments show that the GAND does not need a Full Fine Tune (FFT); PEFT, e.g., LoRA is sufficient.
♻ ☆ MLR-Copilot: Autonomous Machine Learning Research based on Large Language Models Agents
Machine learning research, crucial for technological advancements and innovation, often faces significant challenges due to its inherent complexity, slow pace of experimentation, and the necessity for specialized expertise. Motivated by this, we present a new systematic framework, autonomous Machine Learning Research with large language models (MLR-Copilot), designed to enhance machine learning research productivity through the automatic generation and implementation of research ideas using Large Language Model (LLM) agents. The framework consists of three phases: research idea generation, experiment implementation, and implementation execution. First, existing research papers are used to generate hypotheses and experimental plans vis IdeaAgent powered by LLMs. Next, the implementation generation phase translates these plans into executables with ExperimentAgent. This phase leverages retrieved prototype code and optionally retrieves candidate models and data. Finally, the execution phase, also managed by ExperimentAgent, involves running experiments with mechanisms for human feedback and iterative debugging to enhance the likelihood of achieving executable research outcomes. We evaluate our framework on five machine learning research tasks and the experimental results show the framework's potential to facilitate the research progress and innovations.
♻ ☆ A Grey-box Attack against Latent Diffusion Model-based Image Editing by Posterior Collapse
Recent advancements in generative AI, particularly Latent Diffusion Models (LDMs), have revolutionized image synthesis and manipulation. However, these generative techniques raises concerns about data misappropriation and intellectual property infringement. Adversarial attacks on machine learning models have been extensively studied, and a well-established body of research has extended these techniques as a benign metric to prevent the underlying misuse of generative AI. Current approaches to safeguarding images from manipulation by LDMs are limited by their reliance on model-specific knowledge and their inability to significantly degrade semantic quality of generated images. In response to these shortcomings, we propose the Posterior Collapse Attack (PCA) based on the observation that VAEs suffer from posterior collapse during training. Our method minimizes dependence on the white-box information of target models to get rid of the implicit reliance on model-specific knowledge. By accessing merely a small amount of LDM parameters, in specific merely the VAE encoder of LDMs, our method causes a substantial semantic collapse in generation quality, particularly in perceptual consistency, and demonstrates strong transferability across various model architectures. Experimental results show that PCA achieves superior perturbation effects on image generation of LDMs with lower runtime and VRAM. Our method outperforms existing techniques, offering a more robust and generalizable solution that is helpful in alleviating the socio-technical challenges posed by the rapidly evolving landscape of generative AI.
comment: 21 pages, 7 figures, 10 tables
♻ ☆ Pearl: A Production-ready Reinforcement Learning Agent
Reinforcement learning (RL) is a versatile framework for optimizing long-term goals. Although many real-world problems can be formalized with RL, learning and deploying a performant RL policy requires a system designed to address several important challenges, including the exploration-exploitation dilemma, partial observability, dynamic action spaces, and safety concerns. While the importance of these challenges has been well recognized, existing open-source RL libraries do not explicitly address them. This paper introduces Pearl, a Production-Ready RL software package designed to embrace these challenges in a modular way. In addition to presenting benchmarking results, we also highlight examples of Pearl's ongoing industry adoption to demonstrate its advantages for production use cases. Pearl is open sourced on GitHub at github.com/facebookresearch/pearl and its official website is pearlagent.github.io.
♻ ☆ Fault Tolerant ML: Efficient Meta-Aggregation and Synchronous Training
In this paper, we investigate the challenging framework of Byzantine-robust training in distributed machine learning (ML) systems, focusing on enhancing both efficiency and practicality. As distributed ML systems become integral for complex ML tasks, ensuring resilience against Byzantine failures-where workers may contribute incorrect updates due to malice or error-gains paramount importance. Our first contribution is the introduction of the Centered Trimmed Meta Aggregator (CTMA), an efficient meta-aggregator that upgrades baseline aggregators to optimal performance levels, while requiring low computational demands. Additionally, we propose harnessing a recently developed gradient estimation technique based on a double-momentum strategy within the Byzantine context. Our paper highlights its theoretical and practical advantages for Byzantine-robust training, especially in simplifying the tuning process and reducing the reliance on numerous hyperparameters. The effectiveness of this technique is supported by theoretical insights within the stochastic convex optimization (SCO) framework and corroborated by empirical evidence.
♻ ☆ An Idiosyncrasy of Time-discretization in Reinforcement Learning
Many reinforcement learning algorithms are built on an assumption that an agent interacts with an environment over fixed-duration, discrete time steps. However, physical systems are continuous in time, requiring a choice of time-discretization granularity when digitally controlling them. Furthermore, such systems do not wait for decisions to be made before advancing the environment state, necessitating the study of how the choice of discretization may affect a reinforcement learning algorithm. In this work, we consider the relationship between the definitions of the continuous-time and discrete-time returns. Specifically, we acknowledge an idiosyncrasy with naively applying a discrete-time algorithm to a discretized continuous-time environment, and note how a simple modification can better align the return definitions. This observation is of practical consideration when dealing with environments where time-discretization granularity is a choice, or situations where such granularity is inherently stochastic.
comment: RLC 2024
♻ ☆ On the Benefits of Public Representations for Private Transfer Learning under Distribution Shift
Public pretraining is a promising approach to improve differentially private model training. However, recent work has noted that many positive research results studying this paradigm only consider in-distribution tasks, and may not apply to settings where there is distribution shift between the pretraining and finetuning data -- a scenario that is likely when finetuning private tasks due to the sensitive nature of the data. In this work, we show empirically across three tasks that even in settings with large distribution shift, where both zero-shot performance from public data and training from scratch with private data give unusably weak results, public features can in fact improve private training accuracy by up to 67\% over private training from scratch. We provide a theoretical explanation for this phenomenon, showing that if the public and private data share a low-dimensional representation, public representations can improve the sample complexity of private training even if it is impossible to learn the private task from the public data alone. Altogether, our results provide evidence that public data can indeed make private training practical in realistic settings of extreme distribution shift.
♻ ☆ From Wide to Deep: Dimension Lifting Network for Parameter-efficient Knowledge Graph Embedding
Knowledge graph embedding (KGE) that maps entities and relations into vector representations is essential for downstream applications. Conventional KGE methods require high-dimensional representations to learn the complex structure of knowledge graph, but lead to oversized model parameters. Recent advances reduce parameters by low-dimensional entity representations, while developing techniques (e.g., knowledge distillation or reinvented representation forms) to compensate for reduced dimension. However, such operations introduce complicated computations and model designs that may not benefit large knowledge graphs. To seek a simple strategy to improve the parameter efficiency of conventional KGE models, we take inspiration from that deeper neural networks require exponentially fewer parameters to achieve expressiveness comparable to wider networks for compositional structures. We view all entity representations as a single-layer embedding network, and conventional KGE methods that adopt high-dimensional entity representations equal widening the embedding network to gain expressiveness. To achieve parameter efficiency, we instead propose a deeper embedding network for entity representations, i.e., a narrow entity embedding layer plus a multi-layer dimension lifting network (LiftNet). Experiments on three public datasets show that by integrating LiftNet, four conventional KGE methods with 16-dimensional representations achieve comparable link prediction accuracy as original models that adopt 512-dimensional representations, saving 68.4% to 96.9% parameters.
♻ ☆ TrajDeleter: Enabling Trajectory Forgetting in Offline Reinforcement Learning Agents NDSS 2025
Reinforcement learning (RL) trains an agent from experiences interacting with the environment. In scenarios where online interactions are impractical, offline RL, which trains the agent using pre-collected datasets, has become popular. While this new paradigm presents remarkable effectiveness across various real-world domains, like healthcare and energy management, there is a growing demand to enable agents to rapidly and completely eliminate the influence of specific trajectories from both the training dataset and the trained agents. To meet this problem, this paper advocates Trajdeleter, the first practical approach to trajectory unlearning for offline RL agents. The key idea of Trajdeleter is to guide the agent to demonstrate deteriorating performance when it encounters states associated with unlearning trajectories. Simultaneously, it ensures the agent maintains its original performance level when facing other remaining trajectories. Additionally, we introduce Trajauditor, a simple yet efficient method to evaluate whether Trajdeleter successfully eliminates the specific trajectories of influence from the offline RL agent. Extensive experiments conducted on six offline RL algorithms and three tasks demonstrate that Trajdeleter requires only about 1.5% of the time needed for retraining from scratch. It effectively unlearns an average of 94.8% of the targeted trajectories yet still performs well in actual environment interactions after unlearning. The replication package and agent parameters are available online.
comment: Accepted at NDSS 2025. The presented document here is the full version of our paper
♻ ☆ Blending Neural Operators and Relaxation Methods in PDE Numerical Solvers
Neural networks suffer from spectral bias having difficulty in representing the high frequency components of a function while relaxation methods can resolve high frequencies efficiently but stall at moderate to low frequencies. We exploit the weaknesses of the two approaches by combining them synergistically to develop a fast numerical solver of partial differential equations (PDEs) at scale. Specifically, we propose HINTS, a hybrid, iterative, numerical, and transferable solver by integrating a Deep Operator Network (DeepONet) with standard relaxation methods, leading to parallel efficiency and algorithmic scalability for a wide class of PDEs, not tractable with existing monolithic solvers. HINTS balances the convergence behavior across the spectrum of eigenmodes by utilizing the spectral bias of DeepONet, resulting in a uniform convergence rate and hence exceptional performance of the hybrid solver overall. Moreover, HINTS applies to large-scale, multidimensional systems, it is flexible with regards to discretizations, computational domain, and boundary conditions.
comment: Main text: 17 pages, 6 figures. Supplementary Information: 30 pages, 8 figures, 2 tables, 4 algorithms
Multimedia
☆ Spectron: Target Speaker Extraction using Conditional Transformer with Adversarial Refinement
Recently, attention-based transformers have become a de facto standard in many deep learning applications including natural language processing, computer vision, signal processing, etc.. In this paper, we propose a transformer-based end-to-end model to extract a target speaker's speech from a monaural multi-speaker mixed audio signal. Unlike existing speaker extraction methods, we introduce two additional objectives to impose speaker embedding consistency and waveform encoder invertibility and jointly train both speaker encoder and speech separator to better capture the speaker conditional embedding. Furthermore, we leverage a multi-scale discriminator to refine the perceptual quality of the extracted speech. Our experiments show that the use of a dual path transformer in the separator backbone along with proposed training paradigm improves the CNN baseline by $3.12$ dB points. Finally, we compare our approach with recent state-of-the-arts and show that our model outperforms existing methods by $4.1$ dB points on an average without creating additional data dependency.
☆ Multi-Reference Generative Face Video Compression with Contrastive Learning
Generative face video coding (GFVC) has been demonstrated as a potential approach to low-latency, low bitrate video conferencing. GFVC frameworks achieve an extreme gain in coding efficiency with over 70% bitrate savings when compared to conventional codecs at bitrates below 10kbps. In recent MPEG/JVET standardization efforts, all the information required to reconstruct video sequences using GFVC frameworks are adopted as part of the supplemental enhancement information (SEI) in existing compression pipelines. In light of this development, we aim to address a challenge that has been weakly addressed in prior GFVC frameworks, i.e., reconstruction drift as the distance between the reference and target frames increases. This challenge creates the need to update the reference buffer more frequently by transmitting more Intra-refresh frames, which are the most expensive element of the GFVC bitstream. To overcome this problem, we propose instead multiple reference animation as a robust approach to minimizing reconstruction drift, especially when used in a bi-directional prediction mode. Further, we propose a contrastive learning formulation for multi-reference animation. We observe that using a contrastive learning framework enhances the representation capabilities of the animation generator. The resulting framework, MRDAC (Multi-Reference Deep Animation Codec) can therefore be used to compress longer sequences with fewer reference frames or achieve a significant gain in reconstruction accuracy at comparable bitrates to previous frameworks. Quantitative and qualitative results show significant coding and reconstruction quality gains compared to previous GFVC methods, and more accurate animation quality in presence of large pose and facial expression changes.
☆ Interpretable Convolutional SyncNet
Because videos in the wild can be out of sync for various reasons, a sync-net is used to bring the video back into sync for tasks that require synchronized videos. Previous state-of-the-art (SOTA) sync-nets use InfoNCE loss, rely on the transformer architecture, or both. Unfortunately, the former makes the model's output difficult to interpret, and the latter is unfriendly with large images, thus limiting the usefulness of sync-nets. In this work, we train a convolutional sync-net using the balanced BCE loss (BBCE), a loss inspired by the binary cross entropy (BCE) and the InfoNCE losses. In contrast to the InfoNCE loss, the BBCE loss does not require complicated sampling schemes. Our model can better handle larger images, and its output can be given a probabilistic interpretation. The probabilistic interpretation allows us to define metrics such as probability at offset and offscreen ratio to evaluate the sync quality of audio-visual (AV) speech datasets. Furthermore, our model achieves SOTA accuracy of $96.5\%$ on the LRS2 dataset and $93.8\%$ on the LRS3 dataset.
comment: 8+5 pages
♻ ☆ Inter-Frame Compression for Dynamic Point Cloud Geometry Coding
Efficient point cloud compression is essential for applications like virtual and mixed reality, autonomous driving, and cultural heritage. This paper proposes a deep learning-based inter-frame encoding scheme for dynamic point cloud geometry compression. We propose a lossy geometry compression scheme that predicts the latent representation of the current frame using the previous frame by employing a novel feature space inter-prediction network. The proposed network utilizes sparse convolutions with hierarchical multiscale 3D feature learning to encode the current frame using the previous frame. The proposed method introduces a novel predictor network for motion compensation in the feature domain to map the latent representation of the previous frame to the coordinates of the current frame to predict the current frame's feature embedding. The framework transmits the residual of the predicted features and the actual features by compressing them using a learned probabilistic factorized entropy model. At the receiver, the decoder hierarchically reconstructs the current frame by progressively rescaling the feature embedding. The proposed framework is compared to the state-of-the-art Video-based Point Cloud Compression (V-PCC) and Geometry-based Point Cloud Compression (G-PCC) schemes standardized by the Moving Picture Experts Group (MPEG). The proposed method achieves more than 88% BD-Rate (Bjontegaard Delta Rate) reduction against G-PCCv20 Octree, more than 56% BD-Rate savings against G-PCCv20 Trisoup, more than 62% BD-Rate reduction against V-PCC intra-frame encoding mode, and more than 52% BD-Rate savings against V-PCC P-frame-based inter-frame encoding mode using HEVC. These significant performance gains are cross-checked and verified in the MPEG working group.
♻ ☆ MCDubber: Multimodal Context-Aware Expressive Video Dubbing SC2024
Automatic Video Dubbing (AVD) aims to take the given script and generate speech that aligns with lip motion and prosody expressiveness. Current AVD models mainly utilize visual information of the current sentence to enhance the prosody of synthesized speech. However, it is crucial to consider whether the prosody of the generated dubbing aligns with the multimodal context, as the dubbing will be combined with the original context in the final video. This aspect has been overlooked in previous studies. To address this issue, we propose a Multimodal Context-aware video Dubbing model, termed \textbf{MCDubber}, to convert the modeling object from a single sentence to a longer sequence with context information to ensure the consistency of the global context prosody. MCDubber comprises three main components: (1) A context duration aligner aims to learn the context-aware alignment between the text and lip frames; (2) A context prosody predictor seeks to read the global context visual sequence and predict the context-aware global energy and pitch; (3) A context acoustic decoder ultimately predicts the global context mel-spectrogram with the assistance of adjacent ground-truth mel-spectrograms of the target sentence. Through this process, MCDubber fully considers the influence of multimodal context on the prosody expressiveness of the current sentence when dubbing. The extracted mel-spectrogram belonging to the target sentence from the output context mel-spectrograms is the final required dubbing audio. Extensive experiments on the Chem benchmark dataset demonstrate that our MCDubber significantly improves dubbing expressiveness compared to all advanced baselines. The code and demos are available at https://github.com/XiaoYuanJun-zy/MCDubber.
comment: Accepted by NCMMSC2024
♻ ☆ Show Me the World in My Language: Establishing the First Baseline for Scene-Text to Scene-Text Translation ICPR 2024
In this work, we study the task of ``visually'' translating scene text from a source language (e.g., Hindi) to a target language (e.g., English). Visual translation involves not just the recognition and translation of scene text but also the generation of the translated image that preserves visual features of the source scene text, such as font, size, and background. There are several challenges associated with this task, such as translation with limited context, deciding between translation and transliteration, accommodating varying text lengths within fixed spatial boundaries, and preserving the font and background styles of the source scene text in the target language. To address this problem, we make the following contributions: (i) We study visual translation as a standalone problem for the first time in the literature. (ii) We present a cascaded framework for visual translation that combines state-of-the-art modules for scene text recognition, machine translation, and scene text synthesis as a baseline for the task. (iii) We propose a set of task-specific design enhancements to design a variant of the baseline to obtain performance improvements. (iv) Currently, the existing related literature lacks any comprehensive performance evaluation for this novel task. To fill this gap, we introduce several automatic and user-assisted evaluation metrics designed explicitly for evaluating visual translation. Further, we evaluate presented baselines for translating scene text between Hindi and English. Our experiments demonstrate that although we can effectively perform visual translation over a large collection of scene text images, the presented baseline only partially addresses challenges posed by visual translation tasks. We firmly believe that this new task and the limitations of existing models, as reported in this paper, should encourage further research in visual translation.
comment: Accepted at ICPR 2024, Project Website: https://vl2g.github.io/projects/visTrans/
Computation and Language
♻ ☆ CMAT: A Multi-Agent Collaboration Tuning Framework for Enhancing Small Language Models
Open large language models (LLMs) have significantly advanced the field of natural language processing, showcasing impressive performance across various tasks.Despite the significant advancements in LLMs, their effective operation still relies heavily on human input to accurately guide the dialogue flow, with agent tuning being a crucial optimization technique that involves human adjustments to the model for better response to such guidance.Addressing this dependency, our work introduces the TinyAgent model, trained on a meticulously curated high-quality dataset. We also present the Collaborative Multi-Agent Tuning (CMAT) framework, an innovative system designed to augment language agent capabilities through adaptive weight updates based on environmental feedback. This framework fosters collaborative learning and real-time adaptation among multiple intelligent agents, enhancing their context-awareness and long-term memory. In this research, we propose a new communication agent framework that integrates multi-agent systems with environmental feedback mechanisms, offering a scalable method to explore cooperative behaviors. Notably, our TinyAgent-7B model exhibits performance on par with GPT-3.5, despite having fewer parameters, signifying a substantial improvement in the efficiency and effectiveness of LLMs.
♻ ☆ Dynamic Boundary Time Warping for Sub-sequence Matching with Few Examples
The paper presents a novel method of finding a fragment in a long temporal sequence similar to the set of shorter sequences. We are the first to propose an algorithm for such a search that does not rely on computing the average sequence from query examples. Instead, we use query examples as is, utilizing all of them simultaneously. The introduced method based on the Dynamic Time Warping (DTW) technique is suited explicitly for few-shot query-by-example retrieval tasks. We evaluate it on two different few-shot problems from the field of Natural Language Processing. The results show it either outperforms baselines and previous approaches or achieves comparable results when a low number of examples is available.
♻ ☆ REBEL: Reinforcement Learning via Regressing Relative Rewards
While originally developed for continuous control problems, Proximal Policy Optimization (PPO) has emerged as the work-horse of a variety of reinforcement learning (RL) applications, including the fine-tuning of generative models. Unfortunately, PPO requires multiple heuristics to enable stable convergence (e.g. value networks, clipping), and is notorious for its sensitivity to the precise implementation of these components. In response, we take a step back and ask what a minimalist RL algorithm for the era of generative models would look like. We propose REBEL, an algorithm that cleanly reduces the problem of policy optimization to regressing the relative reward between two completions to a prompt in terms of the policy, enabling strikingly lightweight implementation. In theory, we prove that fundamental RL algorithms like Natural Policy Gradient can be seen as variants of REBEL, which allows us to match the strongest known theoretical guarantees in terms of convergence and sample complexity in the RL literature. REBEL can also cleanly incorporate offline data and be extended to handle the intransitive preferences we frequently see in practice. Empirically, we find that REBEL provides a unified approach to language modeling and image generation with stronger or similar performance as PPO and DPO, all while being simpler to implement and more computationally efficient than PPO. When fine-tuning Llama-3-8B-Instruct, REBEL achieves strong performance in AlpacaEval 2.0, MT-Bench, and Open LLM Leaderboard.
comment: New experimental results on general chat
♻ ☆ MedFuzz: Exploring the Robustness of Large Language Models in Medical Question Answering
Large language models (LLM) have achieved impressive performance on medical question-answering benchmarks. However, high benchmark accuracy does not imply that the performance generalizes to real-world clinical settings. Medical question-answering benchmarks rely on assumptions consistent with quantifying LLM performance but that may not hold in the open world of the clinic. Yet LLMs learn broad knowledge that can help the LLM generalize to practical conditions regardless of unrealistic assumptions in celebrated benchmarks. We seek to quantify how well LLM medical question-answering benchmark performance generalizes when benchmark assumptions are violated. Specifically, we present an adversarial method that we call MedFuzz (for medical fuzzing). MedFuzz attempts to modify benchmark questions in ways aimed at confounding the LLM. We demonstrate the approach by targeting strong assumptions about patient characteristics presented in the MedQA benchmark. Successful "attacks" modify a benchmark item in ways that would be unlikely to fool a medical expert but nonetheless "trick" the LLM into changing from a correct to an incorrect answer. Further, we present a permutation test technique that can ensure a successful attack is statistically significant. We show how to use performance on a "MedFuzzed" benchmark, as well as individual successful attacks. The methods show promise at providing insights into the ability of an LLM to operate robustly in more realistic settings.
comment: 9 pages, 3 figures, 2 algorithms, appendix
♻ ☆ Forecasting Live Chat Intent from Browsing History CIKM 2024
Customers reach out to online live chat agents with various intents, such as asking about product details or requesting a return. In this paper, we propose the problem of predicting user intent from browsing history and address it through a two-stage approach. The first stage classifies a user's browsing history into high-level intent categories. Here, we represent each browsing history as a text sequence of page attributes and use the ground-truth class labels to fine-tune pretrained Transformers. The second stage provides a large language model (LLM) with the browsing history and predicted intent class to generate fine-grained intents. For automatic evaluation, we use a separate LLM to judge the similarity between generated and ground-truth intents, which closely aligns with human judgments. Our two-stage approach yields significant performance gains compared to generating intents without the classification stage.
comment: CIKM 2024
♻ ☆ Greedy Grammar Induction with Indirect Negative Evidence
This paper offers a fresh look at the pumping lemma constant as an upper bound on the information required for learning Context Free Grammars. An objective function based on indirect negative evidence considers the occurrences, and non-occurrences, of a finite number of strings, encountered after a sufficiently long presentation. This function has optimal substructure in the hypotheses space, giving rise to a greedy search learner in a branch and bound method. A hierarchy of learnable classes is defined in terms of the number of production rules that must be added to interim solutions in order to incrementally fit the input. Efficiency strongly depends on the position of the target grammar in the hierarchy and on the richness of the input.
comment: 12 pages (including references), 1 png files. 5 anciliary files (dataset)
♻ ☆ LongRAG: Enhancing Retrieval-Augmented Generation with Long-context LLMs
In traditional RAG framework, the basic retrieval units are normally short. The common retrievers like DPR normally work with 100-word Wikipedia paragraphs. Such a design forces the retriever to search over a large corpus to find the `needle' unit. In contrast, the readers only need to generate answers from the short retrieved units. The imbalanced `heavy' retriever and `light' reader design can lead to sub-optimal performance. The loss of contextual information in the short, chunked units may increase the likelihood of introducing hard negatives during the retrieval stage. Additionally, the reader might not fully leverage the capabilities of recent advancements in LLMs. In order to alleviate the imbalance, we propose a new framework LongRAG, consisting of a `long retriever' and a `long reader'. In the two Wikipedia-based datasets, NQ and HotpotQA, LongRAG processes the entire Wikipedia corpus into 4K-token units by grouping related documents. By increasing the unit size, we significantly reduce the total number of units. This greatly reduces the burden on the retriever, resulting in strong retrieval performance with only a few (less than 8) top units. Without requiring any training, LongRAG achieves an EM of 62.7% on NQ and 64.3% on HotpotQA, which are on par with the (fully-trained) SoTA model. Furthermore, we test on two non-Wikipedia-based datasets, Qasper and MultiFieldQA-en. LongRAG processes each individual document as a single (long) unit rather than chunking them into smaller units. By doing so, we achieve an F1 score of 25.9% on Qasper and 57.5% on MultiFieldQA-en. Our study offers insights into the future roadmap for combining RAG with long-context LLMs.
comment: Technical Report
♻ ☆ Med-MoE: Mixture of Domain-Specific Experts for Lightweight Medical Vision-Language Models
Recent advancements in general-purpose or domain-specific multimodal large language models (LLMs) have witnessed remarkable progress for medical decision-making. However, they are designated for specific classification or generative tasks, and require model training or finetuning on large-scale datasets with sizeable parameters and tremendous computing, hindering their clinical utility across diverse resource-constrained scenarios in practice. In this paper, we propose a novel and lightweight framework Med-MoE (Mixture-of-Experts) that tackles both discriminative and generative multimodal medical tasks. The learning of Med-MoE consists of three steps: multimodal medical alignment, instruction tuning and routing, and domain-specific MoE tuning. After aligning multimodal medical images with LLM tokens, we then enable the model for different multimodal medical tasks with instruction tuning, together with a trainable router tailored for expert selection across input modalities. Finally, the model is tuned by integrating the router with multiple domain-specific experts, which are selectively activated and further empowered by meta expert. Comprehensive experiments on both open- and close-end medical question answering (Med-VQA) and image classification tasks across datasets such as VQA-RAD, SLAKE and Path-VQA demonstrate that our model can achieve performance superior to or on par with state-of-the-art baselines, while only requiring approximately 30\%-50\% of activated model parameters. Extensive analysis and ablations corroborate the effectiveness and practical utility of our method.
♻ ☆ T2VSafetyBench: Evaluating the Safety of Text-to-Video Generative Models
The recent development of Sora leads to a new era in text-to-video (T2V) generation. Along with this comes the rising concern about its security risks. The generated videos may contain illegal or unethical content, and there is a lack of comprehensive quantitative understanding of their safety, posing a challenge to their reliability and practical deployment. Previous evaluations primarily focus on the quality of video generation. While some evaluations of text-to-image models have considered safety, they cover fewer aspects and do not address the unique temporal risk inherent in video generation. To bridge this research gap, we introduce T2VSafetyBench, a new benchmark designed for conducting safety-critical assessments of text-to-video models. We define 12 critical aspects of video generation safety and construct a malicious prompt dataset including real-world prompts, LLM-generated prompts and jailbreak attack-based prompts. Based on our evaluation results, we draw several important findings, including: 1) no single model excels in all aspects, with different models showing various strengths; 2) the correlation between GPT-4 assessments and manual reviews is generally high; 3) there is a trade-off between the usability and safety of text-to-video generative models. This indicates that as the field of video generation rapidly advances, safety risks are set to surge, highlighting the urgency of prioritizing video safety. We hope that T2VSafetyBench can provide insights for better understanding the safety of video generation in the era of generative AI.
♻ ☆ CancerLLM: A Large Language Model in Cancer Domain
Medical Large Language Models (LLMs) such as ClinicalCamel 70B, Llama3-OpenBioLLM 70B have demonstrated impressive performance on a wide variety of medical NLP task.However, there still lacks a large language model (LLM) specifically designed for cancer domain. Moreover, these LLMs typically have billions of parameters, making them computationally expensive for healthcare systems.Thus, in this study, we propose CancerLLM, a model with 7 billion parameters and a Mistral-style architecture, pre-trained on 2,676,642 clinical notes and 515,524 pathology reports covering 17 cancer types, followed by fine-tuning on three cancer-relevant tasks, including cancer phenotypes extraction, and cancer diagnosis generation. Our evaluation demonstrated that CancerLLM achieves state-of-the-art results compared to other existing LLMs, with an average F1 score improvement of 7.61 %. Additionally, CancerLLM outperforms other models on two proposed robustness testbeds. This illustrates that CancerLLM can be effectively applied to clinical AI systems, enhancing clinical research and healthcare delivery in the field of cancer.
comment: add the diagnosis evaluation of ICD code
♻ ☆ MLRegTest: A Benchmark for the Machine Learning of Regular Languages
Synthetic datasets constructed from formal languages allow fine-grained examination of the learning and generalization capabilities of machine learning systems for sequence classification. This article presents a new benchmark for machine learning systems on sequence classification called MLRegTest, which contains training, development, and test sets from 1,800 regular languages. Different kinds of formal languages represent different kinds of long-distance dependencies, and correctly identifying long-distance dependencies in sequences is a known challenge for ML systems to generalize successfully. MLRegTest organizes its languages according to their logical complexity (monadic second order, first order, propositional, or monomial expressions) and the kind of logical literals (string, tier-string, subsequence, or combinations thereof). The logical complexity and choice of literal provides a systematic way to understand different kinds of long-distance dependencies in regular languages, and therefore to understand the capacities of different ML systems to learn such long-distance dependencies. Finally, the performance of different neural networks (simple RNN, LSTM, GRU, transformer) on MLRegTest is examined. The main conclusion is that performance depends significantly on the kind of test set, the class of language, and the neural network architecture.
comment: Accepted for publication in the Journal of Machine Learning Research. Dataset available at https://doi.org/10.5061/dryad.dncjsxm4h , code available at https://github.com/heinz-jeffrey/subregular-learning
♻ ☆ The Oscars of AI Theater: A Survey on Role-Playing with Language Models
This survey explores the burgeoning field of role-playing with language models, focusing on their development from early persona-based models to advanced character-driven simulations facilitated by Large Language Models (LLMs). Initially confined to simple persona consistency due to limited model capabilities, role-playing tasks have now expanded to embrace complex character portrayals involving character consistency, behavioral alignment, and overall attractiveness. We provide a comprehensive taxonomy of the critical components in designing these systems, including data, models and alignment, agent architecture and evaluation. This survey not only outlines the current methodologies and challenges, such as managing dynamic personal profiles and achieving high-level persona consistency but also suggests avenues for future research in improving the depth and realism of role-playing applications. The goal is to guide future research by offering a structured overview of current methodologies and identifying potential areas for improvement. Related resources and papers are available at https://github.com/nuochenpku/Awesome-Role-Play-Papers.
comment: 28 pages
♻ ☆ Ex3: Automatic Novel Writing by Extracting, Excelsior and Expanding
Generating long-term texts such as novels using artificial intelligence has always been a challenge. A common approach is to use large language models (LLMs) to construct a hierarchical framework that first plans and then writes. Despite the fact that the generated novels reach a sufficient length, they exhibit poor logical coherence and appeal in their plots and deficiencies in character and event depiction, ultimately compromising the overall narrative quality. In this paper, we propose a method named Extracting Excelsior and Expanding. Ex3 initially extracts structure information from raw novel data. By combining this structure information with the novel data, an instruction-following dataset is meticulously crafted. This dataset is then utilized to fine-tune the LLM, aiming for excelsior generation performance. In the final stage, a tree-like expansion method is deployed to facilitate the generation of arbitrarily long novels. Evaluation against previous methods showcases Ex3's ability to produce higher-quality long-form novels.
♻ ☆ Tur[k]ingBench: A Challenge Benchmark for Web Agents
Can advanced multi-modal models effectively tackle complex web-based tasks? Such tasks are often found on crowdsourcing platforms, where crowdworkers engage in challenging micro-tasks within web-based environments. Building on this idea, we present TurkingBench, a benchmark consisting of tasks presented as web pages with textual instructions and multi-modal contexts. Unlike previous approaches that rely on artificially synthesized web pages, our benchmark uses natural HTML pages originally designed for crowdsourcing workers to perform various annotation tasks. Each task's HTML instructions are instantiated with different values derived from crowdsourcing tasks, creating diverse instances. This benchmark includes 32.2K instances spread across 158 tasks. To support the evaluation of TurkingBench, we have developed a framework that links chatbot responses to actions on web pages (e.g., modifying a text box, selecting a radio button). We assess the performance of cutting-edge private and open-source models, including language-only and vision-language models (such as GPT4 and InternVL), on this benchmark. Our results show that while these models outperform random chance, there is still significant room for improvement. We hope that this benchmark will drive progress in the evaluation and development of web-based agents.
♻ ☆ ReMamba: Equip Mamba with Effective Long-Sequence Modeling
While the Mamba architecture demonstrates superior inference efficiency and competitive performance on short-context natural language processing (NLP) tasks, empirical evidence suggests its capacity to comprehend long contexts is limited compared to transformer-based models. In this study, we investigate the long-context efficiency issues of the Mamba models and propose ReMamba, which enhances Mamba's ability to comprehend long contexts. ReMamba incorporates selective compression and adaptation techniques within a two-stage re-forward process, incurring minimal additional inference costs overhead. Experimental results on the LongBench and L-Eval benchmarks demonstrate ReMamba's efficacy, improving over the baselines by 3.2 and 1.6 points, respectively, and attaining performance almost on par with same-size transformer models.
♻ ☆ Editing Personality for Large Language Models NLPCC 2024
This paper introduces an innovative task focused on editing the personality traits of Large Language Models (LLMs). This task seeks to adjust the models' responses to opinion-related questions on specified topics since an individual's personality often manifests in the form of their expressed opinions, thereby showcasing different personality traits. Specifically, we construct PersonalityEdit, a new benchmark dataset to address this task. Drawing on the theory in Social Psychology, we isolate three representative traits, namely Neuroticism, Extraversion, and Agreeableness, as the foundation for our benchmark. We then gather data using GPT-4, generating responses that align with a specified topic and embody the targeted personality trait. We conduct comprehensive experiments involving various baselines and discuss the representation of personality behavior in LLMs. Our findings uncover potential challenges of the proposed task, illustrating several remaining issues. We anticipate that our work can stimulate further annotation in model editing and personality-related research. Code is available at https://github.com/zjunlp/EasyEdit.
comment: NLPCC 2024
♻ ☆ Amplifying Training Data Exposure through Fine-Tuning with Pseudo-Labeled Memberships
Neural language models (LMs) are vulnerable to training data extraction attacks due to data memorization. This paper introduces a novel attack scenario wherein an attacker adversarially fine-tunes pre-trained LMs to amplify the exposure of the original training data. This strategy differs from prior studies by aiming to intensify the LM's retention of its pre-training dataset. To achieve this, the attacker needs to collect generated texts that are closely aligned with the pre-training data. However, without knowledge of the actual dataset, quantifying the amount of pre-training data within generated texts is challenging. To address this, we propose the use of pseudo-labels for these generated texts, leveraging membership approximations indicated by machine-generated probabilities from the target LM. We subsequently fine-tune the LM to favor generations with higher likelihoods of originating from the pre-training data, based on their membership probabilities. Our empirical findings indicate a remarkable outcome: LMs with over 1B parameters exhibit a four to eight-fold increase in training data exposure. We discuss potential mitigations and suggest future research directions.
comment: 20 pages, 6 figures, 15 tables
♻ ☆ BEADs: Bias Evaluation Across Domains
Recent advancements in large language models (LLMs) have greatly enhanced natural language processing (NLP) applications. Nevertheless, these models often inherit biases from their training data. Despite the availability of various datasets, most are limited to one or two NLP tasks (typically classification or evaluation) and lack comprehensive evaluations across a broader range of NLP tasks. To address this gap, we introduce the Bias Evaluations Across Domains (BEADs) dataset, designed to support a wide array of NLP tasks, including text classification, token classification, bias quantification, and benign language generation. A key focus of this paper is the gold label subset of BEADs, an important portion of the data verified by experts to ensure high reliability. BEADs provides data for both fine-tuning, including classification and language generation tasks, and for evaluating LLMs. Our findings indicate that BEADs effectively identifies numerous biases when fine-tuned on this dataset. It also reduces biases when used for fine-tuning language generation task, while preserving language quality. The results also reveal some prevalent demographic biases in LLMs when BEADs is used for evaluation in demographic task. The benchmarking results highlight the efficacy of fine-tuning LLMs for bias identification and the necessity of comprehensive bias evaluation. We make BEADs publicly available to promote more responsible AI development. The dataset can be accessed at https://huggingface.co/datasets/shainar/BEAD .
comment: under review
♻ ☆ The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery
One of the grand challenges of artificial general intelligence is developing agents capable of conducting scientific research and discovering new knowledge. While frontier models have already been used as aides to human scientists, e.g. for brainstorming ideas, writing code, or prediction tasks, they still conduct only a small part of the scientific process. This paper presents the first comprehensive framework for fully automatic scientific discovery, enabling frontier large language models to perform research independently and communicate their findings. We introduce The AI Scientist, which generates novel research ideas, writes code, executes experiments, visualizes results, describes its findings by writing a full scientific paper, and then runs a simulated review process for evaluation. In principle, this process can be repeated to iteratively develop ideas in an open-ended fashion, acting like the human scientific community. We demonstrate its versatility by applying it to three distinct subfields of machine learning: diffusion modeling, transformer-based language modeling, and learning dynamics. Each idea is implemented and developed into a full paper at a cost of less than $15 per paper. To evaluate the generated papers, we design and validate an automated reviewer, which we show achieves near-human performance in evaluating paper scores. The AI Scientist can produce papers that exceed the acceptance threshold at a top machine learning conference as judged by our automated reviewer. This approach signifies the beginning of a new era in scientific discovery in machine learning: bringing the transformative benefits of AI agents to the entire research process of AI itself, and taking us closer to a world where endless affordable creativity and innovation can be unleashed on the world's most challenging problems. Our code is open-sourced at https://github.com/SakanaAI/AI-Scientist
Computer Vision and Pattern Recognition
♻ ☆ U-BEV: Height-aware Bird's-Eye-View Segmentation and Neural Map-based Relocalization
Efficient relocalization is essential for intelligent vehicles when GPS reception is insufficient or sensor-based localization fails. Recent advances in Bird's-Eye-View (BEV) segmentation allow for accurate estimation of local scene appearance and in turn, can benefit the relocalization of the vehicle. However, one downside of BEV methods is the heavy computation required to leverage the geometric constraints. This paper presents U-BEV, a U-Net inspired architecture that extends the current state-of-the-art by allowing the BEV to reason about the scene on multiple height layers before flattening the BEV features. We show that this extension boosts the performance of the U-BEV by up to 4.11 IoU. Additionally, we combine the encoded neural BEV with a differentiable template matcher to perform relocalization on neural SD-map data. The model is fully end-to-end trainable and outperforms transformer-based BEV methods of similar computational complexity by 1.7 to 2.8 mIoU and BEV-based relocalization by over 26% Recall Accuracy on the nuScenes dataset.
comment: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible
♻ ☆ REBEL: Reinforcement Learning via Regressing Relative Rewards
While originally developed for continuous control problems, Proximal Policy Optimization (PPO) has emerged as the work-horse of a variety of reinforcement learning (RL) applications, including the fine-tuning of generative models. Unfortunately, PPO requires multiple heuristics to enable stable convergence (e.g. value networks, clipping), and is notorious for its sensitivity to the precise implementation of these components. In response, we take a step back and ask what a minimalist RL algorithm for the era of generative models would look like. We propose REBEL, an algorithm that cleanly reduces the problem of policy optimization to regressing the relative reward between two completions to a prompt in terms of the policy, enabling strikingly lightweight implementation. In theory, we prove that fundamental RL algorithms like Natural Policy Gradient can be seen as variants of REBEL, which allows us to match the strongest known theoretical guarantees in terms of convergence and sample complexity in the RL literature. REBEL can also cleanly incorporate offline data and be extended to handle the intransitive preferences we frequently see in practice. Empirically, we find that REBEL provides a unified approach to language modeling and image generation with stronger or similar performance as PPO and DPO, all while being simpler to implement and more computationally efficient than PPO. When fine-tuning Llama-3-8B-Instruct, REBEL achieves strong performance in AlpacaEval 2.0, MT-Bench, and Open LLM Leaderboard.
comment: New experimental results on general chat
♻ ☆ GRACE: Graph-Regularized Attentive Convolutional Entanglement with Laplacian Smoothing for Robust DeepFake Video Detection
As DeepFake video manipulation techniques escalate, posing profound threats, the urgent need to develop efficient detection strategies is underscored. However, one particular issue lies with facial images being mis-detected, often originating from degraded videos or adversarial attacks, leading to unexpected temporal artifacts that can undermine the efficacy of DeepFake video detection techniques. This paper introduces a novel method for robust DeepFake video detection, harnessing the power of the proposed Graph-Regularized Attentive Convolutional Entanglement (GRACE) based on the graph convolutional network with graph Laplacian to address the aforementioned challenges. First, conventional Convolution Neural Networks are deployed to perform spatiotemporal features for the entire video. Then, the spatial and temporal features are mutually entangled by constructing a graph with sparse constraint, enforcing essential features of valid face images in the noisy face sequences remaining, thus augmenting stability and performance for DeepFake video detection. Furthermore, the Graph Laplacian prior is proposed in the graph convolutional network to remove the noise pattern in the feature space to further improve the performance. Comprehensive experiments are conducted to illustrate that our proposed method delivers state-of-the-art performance in DeepFake video detection under noisy face sequences. The source code is available at https://github.com/ming053l/GRACE.
comment: Submitted to TPAMI 2024
♻ ☆ Exploring Driving Behavior for Autonomous Vehicles Based on Gramian Angular Field Vision Transformer
Effective classification of autonomous vehicle (AV) driving behavior emerges as a critical area for diagnosing AV operation faults, enhancing autonomous driving algorithms, and reducing accident rates. This paper presents the Gramian Angular Field Vision Transformer (GAF-ViT) model, designed to analyze AV driving behavior. The proposed GAF-ViT model consists of three key components: GAF Transformer Module, Channel Attention Module, and Multi-Channel ViT Module. These modules collectively convert representative sequences of multivariate behavior into multi-channel images and employ image recognition techniques for behavior classification. A channel attention mechanism is applied to multi-channel images to discern the impact of various driving behavior features. Experimental evaluation on the Waymo Open Dataset of trajectories demonstrates that the proposed model achieves state-of-the-art performance. Furthermore, an ablation study effectively substantiates the efficacy of individual modules within the model.
♻ ☆ Med-MoE: Mixture of Domain-Specific Experts for Lightweight Medical Vision-Language Models
Recent advancements in general-purpose or domain-specific multimodal large language models (LLMs) have witnessed remarkable progress for medical decision-making. However, they are designated for specific classification or generative tasks, and require model training or finetuning on large-scale datasets with sizeable parameters and tremendous computing, hindering their clinical utility across diverse resource-constrained scenarios in practice. In this paper, we propose a novel and lightweight framework Med-MoE (Mixture-of-Experts) that tackles both discriminative and generative multimodal medical tasks. The learning of Med-MoE consists of three steps: multimodal medical alignment, instruction tuning and routing, and domain-specific MoE tuning. After aligning multimodal medical images with LLM tokens, we then enable the model for different multimodal medical tasks with instruction tuning, together with a trainable router tailored for expert selection across input modalities. Finally, the model is tuned by integrating the router with multiple domain-specific experts, which are selectively activated and further empowered by meta expert. Comprehensive experiments on both open- and close-end medical question answering (Med-VQA) and image classification tasks across datasets such as VQA-RAD, SLAKE and Path-VQA demonstrate that our model can achieve performance superior to or on par with state-of-the-art baselines, while only requiring approximately 30\%-50\% of activated model parameters. Extensive analysis and ablations corroborate the effectiveness and practical utility of our method.
♻ ☆ T2VSafetyBench: Evaluating the Safety of Text-to-Video Generative Models
The recent development of Sora leads to a new era in text-to-video (T2V) generation. Along with this comes the rising concern about its security risks. The generated videos may contain illegal or unethical content, and there is a lack of comprehensive quantitative understanding of their safety, posing a challenge to their reliability and practical deployment. Previous evaluations primarily focus on the quality of video generation. While some evaluations of text-to-image models have considered safety, they cover fewer aspects and do not address the unique temporal risk inherent in video generation. To bridge this research gap, we introduce T2VSafetyBench, a new benchmark designed for conducting safety-critical assessments of text-to-video models. We define 12 critical aspects of video generation safety and construct a malicious prompt dataset including real-world prompts, LLM-generated prompts and jailbreak attack-based prompts. Based on our evaluation results, we draw several important findings, including: 1) no single model excels in all aspects, with different models showing various strengths; 2) the correlation between GPT-4 assessments and manual reviews is generally high; 3) there is a trade-off between the usability and safety of text-to-video generative models. This indicates that as the field of video generation rapidly advances, safety risks are set to surge, highlighting the urgency of prioritizing video safety. We hope that T2VSafetyBench can provide insights for better understanding the safety of video generation in the era of generative AI.
♻ ☆ Surgical-VQLA++: Adversarial Contrastive Learning for Calibrated Robust Visual Question-Localized Answering in Robotic Surgery
Medical visual question answering (VQA) bridges the gap between visual information and clinical decision-making, enabling doctors to extract understanding from clinical images and videos. In particular, surgical VQA can enhance the interpretation of surgical data, aiding in accurate diagnoses, effective education, and clinical interventions. However, the inability of VQA models to visually indicate the regions of interest corresponding to the given questions results in incomplete comprehension of the surgical scene. To tackle this, we propose the surgical visual question localized-answering (VQLA) for precise and context-aware responses to specific queries regarding surgical images. Furthermore, to address the strong demand for safety in surgical scenarios and potential corruptions in image acquisition and transmission, we propose a novel approach called Calibrated Co-Attention Gated Vision-Language (C$^2$G-ViL) embedding to integrate and align multimodal information effectively. Additionally, we leverage the adversarial sample-based contrastive learning strategy to boost our performance and robustness. We also extend our EndoVis-18-VQLA and EndoVis-17-VQLA datasets to broaden the scope and application of our data. Extensive experiments on the aforementioned datasets demonstrate the remarkable performance and robustness of our solution. Our solution can effectively combat real-world image corruption. Thus, our proposed approach can serve as an effective tool for assisting surgical education, patient care, and enhancing surgical outcomes.
comment: Accepted by Information Fusion. Code and data availability: https://github.com/longbai1006/Surgical-VQLAPlus
♻ ☆ CSAD: Unsupervised Component Segmentation for Logical Anomaly Detection
To improve logical anomaly detection, some previous works have integrated segmentation techniques with conventional anomaly detection methods. Although these methods are effective, they frequently lead to unsatisfactory segmentation results and require manual annotations. To address these drawbacks, we develop an unsupervised component segmentation technique that leverages foundation models to autonomously generate training labels for a lightweight segmentation network without human labeling. Integrating this new segmentation technique with our proposed Patch Histogram module and the Local-Global Student-Teacher (LGST) module, we achieve a detection AUROC of 95.3% in the MVTec LOCO AD dataset, which surpasses previous SOTA methods. Furthermore, our proposed method provides lower latency and higher throughput than most existing approaches.
♻ ☆ Quantum Implicit Neural Representations
Implicit neural representations have emerged as a powerful paradigm to represent signals such as images and sounds. This approach aims to utilize neural networks to parameterize the implicit function of the signal. However, when representing implicit functions, traditional neural networks such as ReLU-based multilayer perceptrons face challenges in accurately modeling high-frequency components of signals. Recent research has begun to explore the use of Fourier Neural Networks (FNNs) to overcome this limitation. In this paper, we propose Quantum Implicit Representation Network (QIREN), a novel quantum generalization of FNNs. Furthermore, through theoretical analysis, we demonstrate that QIREN possesses a quantum advantage over classical FNNs. Lastly, we conducted experiments in signal representation, image superresolution, and image generation tasks to show the superior performance of QIREN compared to state-of-the-art (SOTA) models. Our work not only incorporates quantum advantages into implicit neural representations but also uncovers a promising application direction for Quantum Neural Networks.
comment: This paper was accepted by icml 2024
♻ ☆ Generic Objects as Pose Probes for Few-Shot View Synthesis
Radiance fields including NeRFs and 3D Gaussians demonstrate great potential in high-fidelity rendering and scene reconstruction, while they require a substantial number of posed images as inputs. COLMAP is frequently employed for preprocessing to estimate poses, while it necessitates a large number of feature matches to operate effectively, and it struggles with scenes characterized by sparse features, large baselines between images, or a limited number of input images. We aim to tackle few-view NeRF reconstruction using only 3 to 6 unposed scene images. Traditional methods often use calibration boards but they are not common in images. We propose a novel idea of utilizing everyday objects, commonly found in both images and real life, as "pose probes". The probe object is automatically segmented by SAM, whose shape is initialized from a cube. We apply a dual-branch volume rendering optimization (object NeRF and scene NeRF) to constrain the pose optimization and jointly refine the geometry. Specifically, object poses of two views are first estimated by PnP matching in an SDF representation, which serves as initial poses. PnP matching, requiring only a few features, is suitable for feature-sparse scenes. Additional views are incrementally incorporated to refine poses from preceding views. In experiments, PoseProbe achieves state-of-the-art performance in both pose estimation and novel view synthesis across multiple datasets. We demonstrate its effectiveness, particularly in few-view and large-baseline scenes where COLMAP struggles. In ablations, using different objects in a scene yields comparable performance. Our project page is available at: \href{https://zhirui-gao.github.io/PoseProbe.github.io/}{this https URL}
♻ ☆ Pensieve: Retrospect-then-Compare Mitigates Visual Hallucination
Multi-modal Large Language Models (MLLMs) demonstrate remarkable success across various vision-language tasks. However, they suffer from visual hallucination, where the generated responses diverge from the provided image. Are MLLMs oblivious to the accurate visual cues when they hallucinate? Our investigation reveals that the visual branch may equally advocate both accurate and erroneous content. To address this issue, we propose Pensieve, a training-free method that leverages the analogous visual hallucinations, which are induced by images sharing common semantic and appearance characteristics, to mitigate hallucination. Specifically, Pensieve enables MLLMs to retrospect relevant images as references and compare their visual content with the test image via confidence score subtraction. Moreover, our paradigm balances the effects of addressing errors from both the visual and textual branches by adaptively scaling the subtracted scores. Experiments on Whoops, LLaVA Bench, POPE, and MME demonstrate the efficacy of Pensieve in mitigating visual hallucination, surpassing other advanced decoding strategies. Pensieve also aids MLLMs in identifying visual details and enhance the specificity of generated image descriptions.
♻ ☆ Tur[k]ingBench: A Challenge Benchmark for Web Agents
Can advanced multi-modal models effectively tackle complex web-based tasks? Such tasks are often found on crowdsourcing platforms, where crowdworkers engage in challenging micro-tasks within web-based environments. Building on this idea, we present TurkingBench, a benchmark consisting of tasks presented as web pages with textual instructions and multi-modal contexts. Unlike previous approaches that rely on artificially synthesized web pages, our benchmark uses natural HTML pages originally designed for crowdsourcing workers to perform various annotation tasks. Each task's HTML instructions are instantiated with different values derived from crowdsourcing tasks, creating diverse instances. This benchmark includes 32.2K instances spread across 158 tasks. To support the evaluation of TurkingBench, we have developed a framework that links chatbot responses to actions on web pages (e.g., modifying a text box, selecting a radio button). We assess the performance of cutting-edge private and open-source models, including language-only and vision-language models (such as GPT4 and InternVL), on this benchmark. Our results show that while these models outperform random chance, there is still significant room for improvement. We hope that this benchmark will drive progress in the evaluation and development of web-based agents.
♻ ☆ MBSS-T1: Model-Based Self-Supervised Motion Correction for Robust Cardiac T1 Mapping
T1 mapping is a valuable quantitative MRI technique for diagnosing diffuse myocardial diseases. Traditional methods, relying on breath-hold sequences and echo triggering, face challenges with patient compliance and arrhythmias, limiting their effectiveness. Image registration can enable motion-robust T1 mapping, but inherent intensity differences between time points pose a challenge. We introduce MBSS-T1, a self-supervised model for motion correction in cardiac T1 mapping, constrained by physical and anatomical principles. The physical constraints ensure expected signal decay behavior, while the anatomical constraints maintain realistic deformations. The unique combination of these constraints ensures accurate T1 mapping along the longitudinal relaxation axis. MBSS-T1 outperformed baseline deep-learning-based image registration approaches in a 5-fold experiment on a public dataset of 210 patients (STONE sequence) and an internal dataset of 19 patients (MOLLI sequence). MBSS-T1 excelled in model fitting quality ($R^2$: 0.975 vs. 0.941, 0.946), anatomical alignment (Dice score: 0.89 vs. 0.84, 0.88), and expert visual quality assessment for the presence of visible motion artifacts (4.33 vs. 3.38, 3.66). MBSS-T1 has the potential to enable motion-robust T1 mapping for a broader range of patients, overcoming challenges such as arrhythmias and suboptimal compliance, and allowing for free-breathing T1 mapping without requiring large training datasets. Our code will be publicly available upon acceptance.
♻ ☆ Achieving Resolution-Agnostic DNN-based Image Watermarking: A Novel Perspective of Implicit Neural Representation ACM MM'24
DNN-based watermarking methods are rapidly developing and delivering impressive performances. Recent advances achieve resolution-agnostic image watermarking by reducing the variant resolution watermarking problem to a fixed resolution watermarking problem. However, such a reduction process can potentially introduce artifacts and low robustness. To address this issue, we propose the first, to the best of our knowledge, Resolution-Agnostic Image WaterMarking (RAIMark) framework by watermarking the implicit neural representation (INR) of image. Unlike previous methods, our method does not rely on the previous reduction process by directly watermarking the continuous signal instead of image pixels, thus achieving resolution-agnostic watermarking. Precisely, given an arbitrary-resolution image, we fit an INR for the target image. As a continuous signal, such an INR can be sampled to obtain images with variant resolutions. Then, we quickly fine-tune the fitted INR to get a watermarked INR conditioned on a binary secret message. A pre-trained watermark decoder extracts the hidden message from any sampled images with arbitrary resolutions. By directly watermarking INR, we achieve resolution-agnostic watermarking with increased robustness. Extensive experiments show that our method outperforms previous methods with significant improvements: averagely improved bit accuracy by 7%$\sim$29%. Notably, we observe that previous methods are vulnerable to at least one watermarking attack (e.g. JPEG, crop, resize), while ours are robust against all watermarking attacks.
comment: Accepted by ACM MM'24
♻ ☆ NEDS-SLAM: A Neural Explicit Dense Semantic SLAM Framework using 3D Gaussian Splatting
We propose NEDS-SLAM, a dense semantic SLAM system based on 3D Gaussian representation, that enables robust 3D semantic mapping, accurate camera tracking, and high-quality rendering in real-time. In the system, we propose a Spatially Consistent Feature Fusion model to reduce the effect of erroneous estimates from pre-trained segmentation head on semantic reconstruction, achieving robust 3D semantic Gaussian mapping. Additionally, we employ a lightweight encoder-decoder to compress the high-dimensional semantic features into a compact 3D Gaussian representation, mitigating the burden of excessive memory consumption. Furthermore, we leverage the advantage of 3D Gaussian splatting, which enables efficient and differentiable novel view rendering, and propose a Virtual Camera View Pruning method to eliminate outlier gaussians, thereby effectively enhancing the quality of scene representations. Our NEDS-SLAM method demonstrates competitive performance over existing dense semantic SLAM methods in terms of mapping and tracking accuracy on Replica and ScanNet datasets, while also showing excellent capabilities in 3D dense semantic mapping.
comment: accepted by RA-L, IEEE Robotics and Automation Letters
♻ ☆ CRS-Diff: Controllable Remote Sensing Image Generation with Diffusion Model
The emergence of generative models has revolutionized the field of remote sensing (RS) image generation. Despite generating high-quality images, existing methods are limited in relying mainly on text control conditions, and thus do not always generate images accurately and stably. In this paper, we propose CRS-Diff, a new RS generative framework specifically tailored for RS image generation, leveraging the inherent advantages of diffusion models while integrating more advanced control mechanisms. Specifically, CRS-Diff can simultaneously support text-condition, metadata-condition, and image-condition control inputs, thus enabling more precise control to refine the generation process. To effectively integrate multiple condition control information, we introduce a new conditional control mechanism to achieve multi-scale feature fusion, thus enhancing the guiding effect of control conditions. To our knowledge, CRS-Diff is the first multiple-condition controllable RS generative model. Experimental results in single-condition and multiple-condition cases have demonstrated the superior ability of our CRS-Diff to generate RS images both quantitatively and qualitatively compared with previous methods. Additionally, our CRS-Diff can serve as a data engine that generates high-quality training data for downstream tasks, e.g., road extraction. The code is available at https://github.com/Sonettoo/CRS-Diff.
♻ ☆ Training and Tuning Generative Neural Radiance Fields for Attribute-Conditional 3D-Aware Face Generation
Generative Neural Radiance Fields (GNeRF)-based 3D-aware GANs have showcased remarkable prowess in crafting high-fidelity images while upholding robust 3D consistency, particularly face generation. However, specific existing models prioritize view consistency over disentanglement, leading to constrained semantic or attribute control during the generation process. While many methods have explored incorporating semantic masks or leveraging 3D Morphable Models (3DMM) priors to imbue models with semantic control, these methods often demand training from scratch, entailing significant computational overhead. In this paper, we propose a novel approach: a conditional GNeRF model that integrates specific attribute labels as input, thus amplifying the controllability and disentanglement capabilities of 3D-aware generative models. Our approach builds upon a pre-trained 3D-aware face model, and we introduce a Training as Init and Optimizing for Tuning (TRIOT) method to train a conditional normalized flow module to enable the facial attribute editing, then optimize the latent vector to improve attribute-editing precision further. Our extensive experiments substantiate the efficacy of our model, showcasing its ability to generate high-quality edits with enhanced view consistency while safeguarding non-target regions. The code for our model is publicly available at https://github.com/zhangqianhui/TT-GNeRF.
comment: 14 pages
♻ ☆ ICAL: Implicit Character-Aided Learning for Enhanced Handwritten Mathematical Expression Recognition ICDAR 2024
Significant progress has been made in the field of handwritten mathematical expression recognition, while existing encoder-decoder methods are usually difficult to model global information in $LaTeX$. Therefore, this paper introduces a novel approach, Implicit Character-Aided Learning (ICAL), to mine the global expression information and enhance handwritten mathematical expression recognition. Specifically, we propose the Implicit Character Construction Module (ICCM) to predict implicit character sequences and use a Fusion Module to merge the outputs of the ICCM and the decoder, thereby producing corrected predictions. By modeling and utilizing implicit character information, ICAL achieves a more accurate and context-aware interpretation of handwritten mathematical expressions. Experimental results demonstrate that ICAL notably surpasses the state-of-the-art(SOTA) models, improving the expression recognition rate (ExpRate) by 2.25\%/1.81\%/1.39\% on the CROHME 2014/2016/2019 datasets respectively, and achieves a remarkable 69.06\% on the challenging HME100k test set. We make our code available on the GitHub: https://github.com/qingzhenduyu/ICAL
comment: ICDAR 2024 Oral Paper
♻ ☆ DiffMap: Enhancing Map Segmentation with Map Prior Using Diffusion Model
Constructing high-definition (HD) maps is a crucial requirement for enabling autonomous driving. In recent years, several map segmentation algorithms have been developed to address this need, leveraging advancements in Bird's-Eye View (BEV) perception. However, existing models still encounter challenges in producing realistic and consistent semantic map layouts. One prominent issue is the limited utilization of structured priors inherent in map segmentation masks. In light of this, we propose DiffMap, a novel approach specifically designed to model the structured priors of map segmentation masks using latent diffusion model. By incorporating this technique, the performance of existing semantic segmentation methods can be significantly enhanced and certain structural errors present in the segmentation outputs can be effectively rectified. Notably, the proposed module can be seamlessly integrated into any map segmentation model, thereby augmenting its capability to accurately delineate semantic information. Furthermore, through extensive visualization analysis, our model demonstrates superior proficiency in generating results that more accurately reflect real-world map layouts, further validating its efficacy in improving the quality of the generated maps.
♻ ☆ Advancing Medical Image Segmentation: Morphology-Driven Learning with Diffusion Transformer BMVC 2024
Understanding the morphological structure of medical images and precisely segmenting the region of interest or abnormality is an important task that can assist in diagnosis. However, the unique properties of medical imaging make clear segmentation difficult,and the high cost and time-consuming task of labeling leads to a coarse-grained representation of ground truth. Facing with these problems, we propose a novel Diffusion Transformer Segmentation (DTS) model for robust segmentation in the presence of noise. We propose an alternative to the dominant Denoising U-Net encoder through experiments applying a transformer architecture, which captures global dependency through self-attention. Additionally, we propose k-neighbor label smoothing, reverse boundary attention, and self-supervised learning with morphology-driven learning to improve the ability to identify complex structures. Our model, which analyzes the morphological representation of images, shows better results than the previous models in various medical imaging modalities, including CT, MRI, and lesion images.
comment: Accepted in BMVC 2024
♻ ☆ BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird's-Eye View Representation ICRA 2023
Multi-sensor fusion is essential for an accurate and reliable autonomous driving system. Recent approaches are based on point-level fusion: augmenting the LiDAR point cloud with camera features. However, the camera-to-LiDAR projection throws away the semantic density of camera features, hindering the effectiveness of such methods, especially for semantic-oriented tasks (such as 3D scene segmentation). In this paper, we break this deeply-rooted convention with BEVFusion, an efficient and generic multi-task multi-sensor fusion framework. It unifies multi-modal features in the shared bird's-eye view (BEV) representation space, which nicely preserves both geometric and semantic information. To achieve this, we diagnose and lift key efficiency bottlenecks in the view transformation with optimized BEV pooling, reducing latency by more than 40x. BEVFusion is fundamentally task-agnostic and seamlessly supports different 3D perception tasks with almost no architectural changes. It establishes the new state of the art on nuScenes, achieving 1.3% higher mAP and NDS on 3D object detection and 13.6% higher mIoU on BEV map segmentation, with 1.9x lower computation cost. Code to reproduce our results is available at https://github.com/mit-han-lab/bevfusion.
comment: ICRA 2023. The first two authors contributed equally to this work. Project page: https://bevfusion.mit.edu
♻ ☆ PanopticPartFormer++: A Unified and Decoupled View for Panoptic Part Segmentation ECCV 2022
Panoptic Part Segmentation (PPS) unifies panoptic and part segmentation into one task. Previous works utilize separate approaches to handle things, stuff, and part predictions without shared computation and task association. We aim to unify these tasks at the architectural level, designing the first end-to-end unified framework, Panoptic-PartFormer. Moreover, we find the previous metric PartPQ biases to PQ. To handle both issues, we first design a meta-architecture that decouples part features and things/stuff features, respectively. We model things, stuff, and parts as object queries and directly learn to optimize all three forms of prediction as a unified mask prediction and classification problem. We term our model as Panoptic-PartFormer. Second, we propose a new metric Part-Whole Quality (PWQ), better to measure this task from pixel-region and part-whole perspectives. It also decouples the errors for part segmentation and panoptic segmentation. Third, inspired by Mask2Former, based on our meta-architecture, we propose Panoptic-PartFormer++ and design a new part-whole cross-attention scheme to boost part segmentation qualities further. We design a new part-whole interaction method using masked cross attention. Finally, extensive ablation studies and analysis demonstrate the effectiveness of both Panoptic-PartFormer and Panoptic-PartFormer++. Compared with previous Panoptic-PartFormer, our Panoptic-PartFormer++ achieves 2% PartPQ and 3% PWQ improvements on the Cityscapes PPS dataset and 5% PartPQ on the Pascal Context PPS dataset. On both datasets, Panoptic-PartFormer++ achieves new state-of-the-art results. Our models can serve as a strong baseline and aid future research in PPS. The source code and trained models will be available at~\url{https://github.com/lxtGH/Panoptic-PartFormer}.
comment: T-PAMI-2024, Extension of PanopticPartFormer (ECCV 2022)
Information Retrieval
☆ A Counterfactual Explanation Framework for Retrieval Models
Explainability has become a crucial concern in today's world, aiming to enhance transparency in machine learning and deep learning models. Information retrieval is no exception to this trend. In existing literature on explainability of information retrieval, the emphasis has predominantly been on illustrating the concept of relevance concerning a retrieval model. The questions addressed include why a document is relevant to a query, why one document exhibits higher relevance than another, or why a specific set of documents is deemed relevant for a query. However, limited attention has been given to understanding why a particular document is considered non-relevant to a query with respect to a retrieval model. In an effort to address this gap, our work focus on the question of what terms need to be added within a document to improve its ranking. This in turn answers the question of which words played a role in not being favored by a retrieval model for a particular query. We use an optimization framework to solve the above-mentioned research problem. % To the best of our knowledge, we mark the first attempt to tackle this specific counterfactual problem. Our experiments show the effectiveness of our proposed approach in predicting counterfactuals for both statistical (e.g. BM25) and deep-learning-based models (e.g. DRMM, DSSM, ColBERT).
☆ Dissecting Temporal Understanding in Text-to-Audio Retrieval
Recent advancements in machine learning have fueled research on multimodal tasks, such as for instance text-to-video and text-to-audio retrieval. These tasks require models to understand the semantic content of video and audio data, including objects, and characters. The models also need to learn spatial arrangements and temporal relationships. In this work, we analyse the temporal ordering of sounds, which is an understudied problem in the context of text-to-audio retrieval. In particular, we dissect the temporal understanding capabilities of a state-of-the-art model for text-to-audio retrieval on the AudioCaps and Clotho datasets. Additionally, we introduce a synthetic text-audio dataset that provides a controlled setting for evaluating temporal capabilities of recent models. Lastly, we present a loss function that encourages text-audio models to focus on the temporal ordering of events. Code and data are available at https://www.robots.ox.ac.uk/~vgg/research/audio-retrieval/dtu/.
comment: 9 pages, 5 figures, ACM Multimedia 2024, https://www.robots.ox.ac.uk/~vgg/research/audio-retrieval/dtu/
☆ Building FKG.in: a Knowledge Graph for Indian Food
This paper presents an ontology design along with knowledge engineering, and multilingual semantic reasoning techniques to build an automated system for assimilating culinary information for Indian food in the form of a knowledge graph. The main focus is on designing intelligent methods to derive ontology designs and capture all-encompassing knowledge about food, recipes, ingredients, cooking characteristics, and most importantly, nutrition, at scale. We present our ongoing work in this workshop paper, describe in some detail the relevant challenges in curating knowledge of Indian food, and propose our high-level ontology design. We also present a novel workflow that uses AI, LLM, and language technology to curate information from recipe blog sites in the public domain to build knowledge graphs for Indian food. The methods for knowledge curation proposed in this paper are generic and can be replicated for any domain. The design is application-agnostic and can be used for AI-driven smart analysis, building recommendation systems for Personalized Digital Health, and complementing the knowledge graph for Indian food with contextual information such as user information, food biochemistry, geographic information, agricultural information, etc.
comment: 14 pages, 3 figures, 25 references, Formal Ontology in Information Systems Conference 2024 - Integrated Food Ontology Workshop
☆ Hound: Hunting Supervision Signals for Few and Zero Shot Node Classification on Text-attributed Graph
Text-attributed graph (TAG) is an important type of graph structured data with text descriptions for each node. Few- and zero-shot node classification on TAGs have many applications in fields such as academia and social networks. However, the two tasks are challenging due to the lack of supervision signals, and existing methods only use the contrastive loss to align graph-based node embedding and language-based text embedding. In this paper, we propose Hound to improve accuracy by introducing more supervision signals, and the core idea is to go beyond the node-text pairs that come with data. Specifically, we design three augmentation techniques, i.e., node perturbation, text matching, and semantics negation to provide more reference nodes for each text and vice versa. Node perturbation adds/drops edges to produce diversified node embeddings that can be matched with a text. Text matching retrieves texts with similar embeddings to match with a node. Semantics negation uses a negative prompt to construct a negative text with the opposite semantics, which is contrasted with the original node and text. We evaluate Hound on 5 datasets and compare with 13 state-of-the-art baselines. The results show that Hound consistently outperforms all baselines, and its accuracy improvements over the best-performing baseline are usually over 5%.
☆ Fair Reciprocal Recommendation in Matching Markets RecSys2024
Recommender systems play an increasingly crucial role in shaping people's opportunities, particularly in online dating platforms. It is essential from the user's perspective to increase the probability of matching with a suitable partner while ensuring an appropriate level of fairness in the matching opportunities. We investigate reciprocal recommendation in two-sided matching markets between agents divided into two sides. In our model, a match is considered successful only when both individuals express interest in each other. Additionally, we assume that agents prefer to appear prominently in the recommendation lists presented to those on the other side. We define each agent's opportunity to be recommended and introduce its fairness criterion, envy-freeness, from the perspective of fair division theory. The recommendations that approximately maximize the expected number of matches, empirically obtained by heuristic algorithms, are likely to result in significant unfairness of opportunity. Therefore, there can be a trade-off between maximizing the expected matches and ensuring fairness of opportunity. To address this challenge, we propose a method to find a policy that is close to being envy-free by leveraging the Nash social welfare function. Experiments on synthetic and real-world datasets demonstrate the effectiveness of our approach in achieving both relatively high expected matches and fairness for opportunities of both sides in reciprocal recommender systems.
comment: Accepted at RecSys2024
☆ A Learnable Agent Collaboration Network Framework for Personalized Multimodal AI Search Engine
Large language models (LLMs) and retrieval-augmented generation (RAG) techniques have revolutionized traditional information access, enabling AI agent to search and summarize information on behalf of users during dynamic dialogues. Despite their potential, current AI search engines exhibit considerable room for improvement in several critical areas. These areas include the support for multimodal information, the delivery of personalized responses, the capability to logically answer complex questions, and the facilitation of more flexible interactions. This paper proposes a novel AI Search Engine framework called the Agent Collaboration Network (ACN). The ACN framework consists of multiple specialized agents working collaboratively, each with distinct roles such as Account Manager, Solution Strategist, Information Manager, and Content Creator. This framework integrates mechanisms for picture content understanding, user profile tracking, and online evolution, enhancing the AI search engine's response quality, personalization, and interactivity. A highlight of the ACN is the introduction of a Reflective Forward Optimization method (RFO), which supports the online synergistic adjustment among agents. This feature endows the ACN with online learning capabilities, ensuring that the system has strong interactive flexibility and can promptly adapt to user feedback. This learning method may also serve as an optimization approach for agent-based systems, potentially influencing other domains of agent applications.
comment: ACMMM 2024 MMGR WORKSHOP
♻ ☆ Dynamic Boundary Time Warping for Sub-sequence Matching with Few Examples
The paper presents a novel method of finding a fragment in a long temporal sequence similar to the set of shorter sequences. We are the first to propose an algorithm for such a search that does not rely on computing the average sequence from query examples. Instead, we use query examples as is, utilizing all of them simultaneously. The introduced method based on the Dynamic Time Warping (DTW) technique is suited explicitly for few-shot query-by-example retrieval tasks. We evaluate it on two different few-shot problems from the field of Natural Language Processing. The results show it either outperforms baselines and previous approaches or achieves comparable results when a low number of examples is available.
♻ ☆ Forecasting Live Chat Intent from Browsing History CIKM 2024
Customers reach out to online live chat agents with various intents, such as asking about product details or requesting a return. In this paper, we propose the problem of predicting user intent from browsing history and address it through a two-stage approach. The first stage classifies a user's browsing history into high-level intent categories. Here, we represent each browsing history as a text sequence of page attributes and use the ground-truth class labels to fine-tune pretrained Transformers. The second stage provides a large language model (LLM) with the browsing history and predicted intent class to generate fine-grained intents. For automatic evaluation, we use a separate LLM to judge the similarity between generated and ground-truth intents, which closely aligns with human judgments. Our two-stage approach yields significant performance gains compared to generating intents without the classification stage.
comment: CIKM 2024
♻ ☆ Sparsity-regularized coded ptychography for robust and efficient lensless microscopy on a chip
Coded ptychography has emerged as a powerful technique for high-throughput, high-resolution lensless imaging. However, the trade-off between acquisition speed and image quality remains a significant challenge. To address this, we introduce a novel sparsity-regularized approach to coded ptychography that dramatically reduces the number of required measurements while maintaining high reconstruction quality. The reported approach, termed the ptychographic proximal total-variation (PPTV) solver, formulates the reconstruction task as a total variation regularized optimization problem. Unlike previous implementations that rely on specialized hardware or illumination schemes, PPTV integrates seamlessly into existing coded ptychography setups. Through comprehensive numerical simulations, we demonstrate that PPTV-driven coded ptychography can produce accurate reconstructions with as few as eight intensity measurements, a significant reduction compared to conventional methods. Convergence analysis confirms the robustness and stability of the PPTV algorithm. Experimental results from our optical prototype, featuring a disorder-engineered surface for wavefront modulation, validate PPTV's ability to achieve high-throughput, high-resolution imaging with a substantially reduced measurement burden. By enabling high-quality reconstructions from fewer measurements, PPTV paves the way for more compact, efficient, and cost-effective lensless microscopy systems on a chip, with potential applications in digital pathology, endoscopy, point-of-care diagnostics, and high-content screening.
comment: 14 pages, 7 figures
♻ ☆ Vague Preference Policy Learning for Conversational Recommendation
Conversational recommendation systems (CRS) commonly assume users have clear preferences, leading to potential over-filtering of relevant alternatives. However, users often exhibit vague, non-binary preferences. We introduce the Vague Preference Multi-round Conversational Recommendation (VPMCR) scenario, employing a soft estimation mechanism to accommodate users' vague and dynamic preferences while mitigating over-filtering. In VPMCR, we propose Vague Preference Policy Learning (VPPL), consisting of Ambiguity-aware Soft Estimation (ASE) and Dynamism-aware Policy Learning (DPL). ASE captures preference vagueness by estimating scores for clicked and non-clicked options, using a choice-based approach and time-aware preference decay. DPL leverages ASE's preference distribution to guide the conversation and adapt to preference changes for recommendations or attribute queries. Extensive experiments demonstrate VPPL's effectiveness within VPMCR, outperforming existing methods and setting a new benchmark. Our work advances CRS by accommodating users' inherent ambiguity and relative decision-making processes, improving real-world applicability.
Machine Learning
♻ ☆ Horseshoe-type Priors for Independent Component Estimation
Independent Component Estimation (ICE) has many applications in modern day machine learning as a feature engineering extraction method. Horseshoe-type priors are used to provide scalable algorithms that enables both point estimates via expectation-maximization (EM) and full posterior sampling via Markov Chain Monte Carlo (MCMC) algorithms. Our methodology also applies to flow-based methods for nonlinear feature extraction and deep learning. We also discuss how to implement conditional posteriors and envelope-based methods for optimization. Through this hierarchy representation, we unify a number of hitherto disparate estimation procedures. We illustrate our methodology and algorithms on a numerical example. Finally, we conclude with directions for future research.
comment: 23 pages, 2 figures
♻ ☆ Multi-Microphone Speech Emotion Recognition using the Hierarchical Token-semantic Audio Transformer Architecture
The performance of most emotion recognition systems degrades in real-life situations ('in the wild' scenarios) where the audio is contaminated by reverberation. Our study explores new methods to alleviate the performance degradation of SER algorithms and develop a more robust system for adverse conditions. We propose processing multi-microphone signals to address these challenges and improve emotion classification accuracy. We adopt a state-of-the-art transformer model, the HTS-AT, to handle multi-channel audio inputs. We evaluate two strategies: averaging mel-spectrograms across channels and summing patch-embedded representations. Our multi-microphone model achieves superior performance compared to single-channel baselines when tested on real-world reverberant environments.
♻ ☆ Discovering Preference Optimization Algorithms with and for Large Language Models
Offline preference optimization is a key method for enhancing and controlling the quality of Large Language Model (LLM) outputs. Typically, preference optimization is approached as an offline supervised learning task using manually-crafted convex loss functions. While these methods are based on theoretical insights, they are inherently constrained by human creativity, so the large search space of possible loss functions remains under explored. We address this by performing LLM-driven objective discovery to automatically discover new state-of-the-art preference optimization algorithms without (expert) human intervention. Specifically, we iteratively prompt an LLM to propose and implement new preference optimization loss functions based on previously-evaluated performance metrics. This process leads to the discovery of previously-unknown and performant preference optimization algorithms. The best performing of these we call Discovered Preference Optimization (DiscoPOP), a novel algorithm that adaptively blends logistic and exponential losses. Experiments demonstrate the state-of-the-art performance of DiscoPOP and its successful transfer to held-out tasks.
♻ ☆ REBEL: Reinforcement Learning via Regressing Relative Rewards
While originally developed for continuous control problems, Proximal Policy Optimization (PPO) has emerged as the work-horse of a variety of reinforcement learning (RL) applications, including the fine-tuning of generative models. Unfortunately, PPO requires multiple heuristics to enable stable convergence (e.g. value networks, clipping), and is notorious for its sensitivity to the precise implementation of these components. In response, we take a step back and ask what a minimalist RL algorithm for the era of generative models would look like. We propose REBEL, an algorithm that cleanly reduces the problem of policy optimization to regressing the relative reward between two completions to a prompt in terms of the policy, enabling strikingly lightweight implementation. In theory, we prove that fundamental RL algorithms like Natural Policy Gradient can be seen as variants of REBEL, which allows us to match the strongest known theoretical guarantees in terms of convergence and sample complexity in the RL literature. REBEL can also cleanly incorporate offline data and be extended to handle the intransitive preferences we frequently see in practice. Empirically, we find that REBEL provides a unified approach to language modeling and image generation with stronger or similar performance as PPO and DPO, all while being simpler to implement and more computationally efficient than PPO. When fine-tuning Llama-3-8B-Instruct, REBEL achieves strong performance in AlpacaEval 2.0, MT-Bench, and Open LLM Leaderboard.
comment: New experimental results on general chat
♻ ☆ MedFuzz: Exploring the Robustness of Large Language Models in Medical Question Answering
Large language models (LLM) have achieved impressive performance on medical question-answering benchmarks. However, high benchmark accuracy does not imply that the performance generalizes to real-world clinical settings. Medical question-answering benchmarks rely on assumptions consistent with quantifying LLM performance but that may not hold in the open world of the clinic. Yet LLMs learn broad knowledge that can help the LLM generalize to practical conditions regardless of unrealistic assumptions in celebrated benchmarks. We seek to quantify how well LLM medical question-answering benchmark performance generalizes when benchmark assumptions are violated. Specifically, we present an adversarial method that we call MedFuzz (for medical fuzzing). MedFuzz attempts to modify benchmark questions in ways aimed at confounding the LLM. We demonstrate the approach by targeting strong assumptions about patient characteristics presented in the MedQA benchmark. Successful "attacks" modify a benchmark item in ways that would be unlikely to fool a medical expert but nonetheless "trick" the LLM into changing from a correct to an incorrect answer. Further, we present a permutation test technique that can ensure a successful attack is statistically significant. We show how to use performance on a "MedFuzzed" benchmark, as well as individual successful attacks. The methods show promise at providing insights into the ability of an LLM to operate robustly in more realistic settings.
comment: 9 pages, 3 figures, 2 algorithms, appendix
♻ ☆ CardioLab: Laboratory Values Estimation from Electrocardiogram Features -- An Exploratory Study
Introduction: Laboratory value represents a cornerstone of medical diagnostics, but suffers from slow turnaround times, and high costs and only provides information about a single point in time. The continuous estimation of laboratory values from non-invasive data such as electrocardiogram (ECG) would therefore mark a significant frontier in healthcare monitoring. Despite its transformative potential, this domain remains relatively underexplored within the medical community. Methods: In this preliminary study, we used a publicly available dataset (MIMIC-IV-ECG) to investigate the feasibility of inferring laboratory values from ECG features and patient demographics using tree-based models (XGBoost). We define the prediction task as a binary prediction problem of predicting whether the lab value falls into low or high abnormalities. The model performance can then be assessed using AUROC. Results: Our findings demonstrate promising results in the estimation of laboratory values related to different organ systems based on a small yet comprehensive set of features. While further research and validation are warranted to fully assess the clinical utility and generalizability of ECG-based estimation in healthcare monitoring, our findings lay the groundwork for future investigations into approaches to laboratory value estimation using ECG data. Such advancements hold promise for revolutionizing predictive healthcare applications, offering faster, non-invasive, and more affordable means of patient monitoring.
comment: 4 pages, (updated dataset features set description version) code under https://github.com/AI4HealthUOL/CardioLab
♻ ☆ Equivariant Scalar Fields for Molecular Docking with Fast Fourier Transforms ICLR 2024
Molecular docking is critical to structure-based virtual screening, yet the throughput of such workflows is limited by the expensive optimization of scoring functions involved in most docking algorithms. We explore how machine learning can accelerate this process by learning a scoring function with a functional form that allows for more rapid optimization. Specifically, we define the scoring function to be the cross-correlation of multi-channel ligand and protein scalar fields parameterized by equivariant graph neural networks, enabling rapid optimization over rigid-body degrees of freedom with fast Fourier transforms. The runtime of our approach can be amortized at several levels of abstraction, and is particularly favorable for virtual screening settings with a common binding pocket. We benchmark our scoring functions on two simplified docking-related tasks: decoy pose scoring and rigid conformer docking. Our method attains similar but faster performance on crystal structures compared to the widely-used Vina and Gnina scoring functions, and is more robust on computationally predicted structures. Code is available at https://github.com/bjing2016/scalar-fields.
comment: ICLR 2024
♻ ☆ ApisTox: a new benchmark dataset for the classification of small molecules toxicity on honey bees
The global decline in bee populations poses significant risks to agriculture, biodiversity, and environmental stability. To bridge the gap in existing data, we introduce ApisTox, a comprehensive dataset focusing on the toxicity of pesticides to honey bees (Apis mellifera). This dataset combines and leverages data from existing sources such as ECOTOX and PPDB, providing an extensive, consistent, and curated collection that surpasses the previous datasets. ApisTox incorporates a wide array of data, including toxicity levels for chemicals, details such as time of their publication in literature, and identifiers linking them to external chemical databases. This dataset may serve as an important tool for environmental and agricultural research, but also can support the development of policies and practices aimed at minimizing harm to bee populations. Finally, ApisTox offers a unique resource for benchmarking molecular property prediction methods on agrochemical compounds, facilitating advancements in both environmental science and cheminformatics. This makes it a valuable tool for both academic research and practical applications in bee conservation.
♻ ☆ T2VSafetyBench: Evaluating the Safety of Text-to-Video Generative Models
The recent development of Sora leads to a new era in text-to-video (T2V) generation. Along with this comes the rising concern about its security risks. The generated videos may contain illegal or unethical content, and there is a lack of comprehensive quantitative understanding of their safety, posing a challenge to their reliability and practical deployment. Previous evaluations primarily focus on the quality of video generation. While some evaluations of text-to-image models have considered safety, they cover fewer aspects and do not address the unique temporal risk inherent in video generation. To bridge this research gap, we introduce T2VSafetyBench, a new benchmark designed for conducting safety-critical assessments of text-to-video models. We define 12 critical aspects of video generation safety and construct a malicious prompt dataset including real-world prompts, LLM-generated prompts and jailbreak attack-based prompts. Based on our evaluation results, we draw several important findings, including: 1) no single model excels in all aspects, with different models showing various strengths; 2) the correlation between GPT-4 assessments and manual reviews is generally high; 3) there is a trade-off between the usability and safety of text-to-video generative models. This indicates that as the field of video generation rapidly advances, safety risks are set to surge, highlighting the urgency of prioritizing video safety. We hope that T2VSafetyBench can provide insights for better understanding the safety of video generation in the era of generative AI.
♻ ☆ Data-Aware Gradient Compression for DML in Communication-Constrained Mobile Computing
Distributed machine learning (DML) in mobile environments faces significant communication bottlenecks. Gradient compression has proven as an effective solution to this issue, offering substantial benefits in environments with limited bandwidth and metered data. Yet, it encounters severe performance drops in non-IID environments due to a one-size-fits-all compression approach, which does not account for the varying data volumes across workers. Assigning varying compression ratios to workers with distinct data distributions and volumes is therefore a promising solution. This work derives the convergence rate of distributed SGD with non-uniform compression, which reveals the intricate relationship between model convergence and the compression ratios applied to individual workers. Accordingly, we frame the relative compression ratio assignment as an $n$-variable chi-squared nonlinear optimization problem, constrained by a limited communication budget. We propose DAGC-R, which assigns conservative compression to workers handling larger data volumes. Recognizing the computational limitations of mobile devices, we propose the DAGC-A, which is computationally less demanding and enhances the robustness of compression in non-IID scenarios. Our experiments confirm that the DAGC-A and DAGC-R can speed up the training speed by up to $16.65\%$ and $25.43\%$ compared to the uniform compression respectively, when dealing with highly imbalanced data volume distribution and restricted communication.
♻ ☆ Creating Temporally Correlated High-Resolution Profiles of Load Injection Using Constrained Generative Adversarial Networks
Traditional smart meters, which measure energy usage every 15 minutes or more and report it at least a few hours later, lack the granularity needed for real-time decision-making. To address this practical problem, we introduce a new method using generative adversarial networks (GAN) that enforces temporal consistency on its high-resolution outputs via hard inequality constraints using convex optimization. A unique feature of our GAN model is that it is trained solely on slow timescale aggregated historical energy data obtained from smart meters. The results demonstrate that the model can successfully create minute-by-minute temporally correlated profiles of power usage from 15-minute interval average power consumption information. This innovative approach, emphasizing inter-neuron constraints, offers a promising avenue for improved high-speed state estimation in distribution systems and enhances the applicability of data-driven solutions for monitoring and subsequently controlling such systems.
comment: 6 pages
♻ ☆ Generalizing Graph Transformers Across Diverse Graphs and Tasks via Pre-Training on Industrial-Scale Data
Graph pre-training has been concentrated on graph-level on small graphs (e.g., molecular graphs) or learning node representations on a fixed graph. Extending graph pre-trained models to web-scale graphs with billions of nodes in industrial scenarios, while avoiding negative transfer across graphs or tasks, remains a challenge. We aim to develop a general graph pre-trained model with inductive ability that can make predictions for unseen new nodes and even new graphs. In this work, we introduce a scalable transformer-based graph pre-training framework called PGT (Pre-trained Graph Transformer). Specifically, we design a flexible and scalable graph transformer as the backbone network. Meanwhile, based on the masked autoencoder architecture, we design two pre-training tasks: one for reconstructing node features and the other one for reconstructing local structures. Unlike the original autoencoder architecture where the pre-trained decoder is discarded, we propose a novel strategy that utilizes the decoder for feature augmentation. We have deployed our framework on Tencent's online game data. Extensive experiments have demonstrated that our framework can perform pre-training on real-world web-scale graphs with over 540 million nodes and 12 billion edges and generalizes effectively to unseen new graphs with different downstream tasks. We further conduct experiments on the publicly available ogbn-papers100M dataset, which consists of 111 million nodes and 1.6 billion edges. Our framework achieves state-of-the-art performance on both industrial datasets and public datasets, while also enjoying scalability and efficiency.
comment: Work in progress
♻ ☆ Harmonizing Generalization and Personalization in Federated Prompt Learning
Federated Prompt Learning (FPL) incorporates large pre-trained Vision-Language models (VLM) into federated learning through prompt tuning. The transferable representations and remarkable generalization capacity of VLM make them highly compatible with the integration of federated learning. Addressing data heterogeneity in federated learning requires personalization, but excessive focus on it across clients could compromise the model's ability to generalize effectively. To preserve the impressive generalization capability of VLM, it is crucial to strike a balance between personalization and generalization in FPL. To tackle this challenge, we proposed Federated Prompt Learning with CLIP Generalization and low-rank Personalization (FedPGP), which employs pre-trained CLIP to provide knowledge-guidance on the global prompt for improved generalization and incorporates a low-rank adaptation term to personalize the global prompt. Further, FedPGP integrates a prompt-wise contrastive loss to achieve knowledge guidance and personalized adaptation simultaneously, enabling a harmonious balance between personalization and generalization in FPL. We conduct extensive experiments on various datasets to explore base-to-novel generalization in both category-level and domain-level scenarios with heterogeneous data, showing the superiority of FedPGP in balancing generalization and personalization.
♻ ☆ A Versatile Graph Learning Approach through LLM-based Agent
Designing versatile graph learning approaches is important, considering the diverse graphs and tasks existing in real-world applications. Existing methods have attempted to achieve this target through automated machine learning techniques, pre-training and fine-tuning strategies, and large language models. However, these methods are not versatile enough for graph learning, as they work on either limited types of graphs or a single task. In this paper, we propose to explore versatile graph learning approaches with LLM-based agents, and the key insight is customizing the graph learning procedures for diverse graphs and tasks. To achieve this, we develop several LLM-based agents, equipped with diverse profiles, tools, functions and human experience. They collaborate to configure each procedure with task and data-specific settings step by step towards versatile solutions, and the proposed method is dubbed GL-Agent. By evaluating on diverse tasks and graphs, the correct results of the agent and its comparable performance showcase the versatility of the proposed method, especially in complex scenarios.The low resource cost and the potential to use open-source LLMs highlight the efficiency of GL-Agent.
♻ ☆ MLRegTest: A Benchmark for the Machine Learning of Regular Languages
Synthetic datasets constructed from formal languages allow fine-grained examination of the learning and generalization capabilities of machine learning systems for sequence classification. This article presents a new benchmark for machine learning systems on sequence classification called MLRegTest, which contains training, development, and test sets from 1,800 regular languages. Different kinds of formal languages represent different kinds of long-distance dependencies, and correctly identifying long-distance dependencies in sequences is a known challenge for ML systems to generalize successfully. MLRegTest organizes its languages according to their logical complexity (monadic second order, first order, propositional, or monomial expressions) and the kind of logical literals (string, tier-string, subsequence, or combinations thereof). The logical complexity and choice of literal provides a systematic way to understand different kinds of long-distance dependencies in regular languages, and therefore to understand the capacities of different ML systems to learn such long-distance dependencies. Finally, the performance of different neural networks (simple RNN, LSTM, GRU, transformer) on MLRegTest is examined. The main conclusion is that performance depends significantly on the kind of test set, the class of language, and the neural network architecture.
comment: Accepted for publication in the Journal of Machine Learning Research. Dataset available at https://doi.org/10.5061/dryad.dncjsxm4h , code available at https://github.com/heinz-jeffrey/subregular-learning
♻ ☆ SFR-GNN: Simple and Fast Robust GNNs against Structural Attacks
Graph Neural Networks (GNNs) have demonstrated commendable performance for graph-structured data. Yet, GNNs are often vulnerable to adversarial structural attacks as embedding generation relies on graph topology. Existing efforts are dedicated to purifying the maliciously modified structure or applying adaptive aggregation, thereby enhancing the robustness against adversarial structural attacks. It is inevitable for a defender to consume heavy computational costs due to lacking prior knowledge about modified structures. To this end, we propose an efficient defense method, called Simple and Fast Robust Graph Neural Network (SFR-GNN), supported by mutual information theory. The SFR-GNN first pre-trains a GNN model using node attributes and then fine-tunes it over the modified graph in the manner of contrastive learning, which is free of purifying modified structures and adaptive aggregation, thus achieving great efficiency gains. Consequently, SFR-GNN exhibits a 24%--162% speedup compared to advanced robust models, demonstrating superior robustness for node classification tasks.
♻ ☆ Finite-dimensional approximations of push-forwards on locally analytic functionals
This paper introduces a novel theoretical framework for investigating analytic maps from finite discrete data. Our approach is to consider the push-forward on the space of locally analytic functionals, instead of directly handling the analytic map itself. We establish a methodology enabling appropriate finite-dimensional approximation of the push-forward from finite discrete data, through the theory of the Fourier--Borel transform and the Fock space. Moreover, we prove a rigorous convergence result with a convergence rate. As an application, we prove that it is not the least-squares polynomial, but the polynomial obtained by truncating its higher-degree terms, that approximates analytic functions and further allows for approximation beyond the support of the data distribution. One advantage of our theory is that it enables us to apply linear algebraic operations to the finite-dimensional approximation of the push-forward. Utilizing this, we prove the convergence of a method for approximating an analytic vector field from finite data of the flow map of an ordinary differential equation.
comment: 32 pages. 2 figures. We modified resutls. Comments are welcome
Multi-Fidelity Active Learning with GFlowNets
In the last decades, the capacity to generate large amounts of data in science and engineering applications has been growing steadily. Meanwhile, machine learning has progressed to become a suitable tool to process and utilise the available data. Nonetheless, many relevant scientific and engineering problems present challenges where current machine learning methods cannot yet efficiently leverage the available data and resources. For example, in scientific discovery, we are often faced with the problem of exploring very large, structured and high-dimensional spaces. Moreover, the high fidelity, black-box objective function is often very expensive to evaluate. Progress in machine learning methods that can efficiently tackle such challenges would help accelerate currently crucial areas such as drug and materials discovery. In this paper, we propose a multi-fidelity active learning algorithm with GFlowNets as a sampler, to efficiently discover diverse, high-scoring candidates where multiple approximations of the black-box function are available at lower fidelity and cost. Our evaluation on molecular discovery tasks shows that multi-fidelity active learning with GFlowNets can discover high-scoring candidates at a fraction of the budget of its single-fidelity counterpart while maintaining diversity, unlike RL-based alternatives. These results open new avenues for multi-fidelity active learning to accelerate scientific discovery and engineering design.
comment: Published in Transactions on Machine Learning Research (TMLR) 07/2024 https://openreview.net/forum?id=dLaazW9zuF
♻ ☆ Bridging the Sim-to-Real Gap with Bayesian Inference
We present SIM-FSVGD for learning robot dynamics from data. As opposed to traditional methods, SIM-FSVGD leverages low-fidelity physical priors, e.g., in the form of simulators, to regularize the training of neural network models. While learning accurate dynamics already in the low data regime, SIM-FSVGD scales and excels also when more data is available. We empirically show that learning with implicit physical priors results in accurate mean model estimation as well as precise uncertainty quantification. We demonstrate the effectiveness of SIM-FSVGD in bridging the sim-to-real gap on a high-performance RC racecar system. Using model-based RL, we demonstrate a highly dynamic parking maneuver with drifting, using less than half the data compared to the state of the art.
♻ ☆ Quantum Implicit Neural Representations
Implicit neural representations have emerged as a powerful paradigm to represent signals such as images and sounds. This approach aims to utilize neural networks to parameterize the implicit function of the signal. However, when representing implicit functions, traditional neural networks such as ReLU-based multilayer perceptrons face challenges in accurately modeling high-frequency components of signals. Recent research has begun to explore the use of Fourier Neural Networks (FNNs) to overcome this limitation. In this paper, we propose Quantum Implicit Representation Network (QIREN), a novel quantum generalization of FNNs. Furthermore, through theoretical analysis, we demonstrate that QIREN possesses a quantum advantage over classical FNNs. Lastly, we conducted experiments in signal representation, image superresolution, and image generation tasks to show the superior performance of QIREN compared to state-of-the-art (SOTA) models. Our work not only incorporates quantum advantages into implicit neural representations but also uncovers a promising application direction for Quantum Neural Networks.
comment: This paper was accepted by icml 2024
♻ ☆ Uncertainty Modeling in Graph Neural Networks via Stochastic Differential Equations
We address the problem of learning uncertainty-aware representations for graph-structured data. While Graph Neural Ordinary Differential Equations (GNODE) are effective in learning node representations, they fail to quantify uncertainty. To address this, we introduce Latent Graph Neural Stochastic Differential Equations (LGNSDE), which enhance GNODE by embedding randomness through Brownian motion to quantify uncertainty. We provide theoretical guarantees for LGNSDE and empirically show better performance in uncertainty quantification.
comment: 9 pages including appendix
♻ ☆ Downlink CCM Estimation via Representation Learning with Graph Regularization
In this paper, we propose an algorithm for downlink (DL) channel covariance matrix (CCM) estimation for frequency division duplexing (FDD) massive multiple-input multiple-output (MIMO) communication systems with base station (BS) possessing a uniform linear array (ULA) antenna structure. We consider a setting where the UL CCM is mapped to DL CCM by a mapping function. We first present a theoretical error analysis of learning a nonlinear embedding by constructing a mapping function, which points to the importance of the Lipschitz regularity of the mapping function for achieving high estimation performance. Then, based on the theoretical ground, we propose a representation learning algorithm as a solution for the estimation problem, where Gaussian RBF kernel interpolators are chosen to map UL CCMs to their DL counterparts. The proposed algorithm is based on the optimization of an objective function that fits a regression model between the DL CCM and UL CCM samples in the training dataset and preserves the local geometric structure of the data in the UL CCM space, while explicitly regulating the Lipschitz continuity of the mapping function in light of our theoretical findings. The proposed algorithm surpasses benchmark methods in terms of three error metrics as shown by simulations.
♻ ☆ Editing Personality for Large Language Models NLPCC 2024
This paper introduces an innovative task focused on editing the personality traits of Large Language Models (LLMs). This task seeks to adjust the models' responses to opinion-related questions on specified topics since an individual's personality often manifests in the form of their expressed opinions, thereby showcasing different personality traits. Specifically, we construct PersonalityEdit, a new benchmark dataset to address this task. Drawing on the theory in Social Psychology, we isolate three representative traits, namely Neuroticism, Extraversion, and Agreeableness, as the foundation for our benchmark. We then gather data using GPT-4, generating responses that align with a specified topic and embody the targeted personality trait. We conduct comprehensive experiments involving various baselines and discuss the representation of personality behavior in LLMs. Our findings uncover potential challenges of the proposed task, illustrating several remaining issues. We anticipate that our work can stimulate further annotation in model editing and personality-related research. Code is available at https://github.com/zjunlp/EasyEdit.
comment: NLPCC 2024
♻ ☆ UniHPF : Universal Healthcare Predictive Framework with Zero Domain Knowledge ML4H
Despite the abundance of Electronic Healthcare Records (EHR), its heterogeneity restricts the utilization of medical data in building predictive models. To address this challenge, we propose Universal Healthcare Predictive Framework (UniHPF), which requires no medical domain knowledge and minimal pre-processing for multiple prediction tasks. Experimental results demonstrate that UniHPF is capable of building large-scale EHR models that can process any form of medical data from distinct EHR systems. We believe that our findings can provide helpful insights for further research on the multi-source learning of EHRs.
comment: The original paper is published on Journal of Biomedical and Health Informatics(JBHI) 2023, https://ieeexplore.ieee.org/document/10298642. Extended Abstract presented at Machine Learning for Health (ML4H) symposium 2022, November 28th, 2022, New Orleans, United States, 19 pages(main paper 6 pages). arXiv admin note: substantial text overlap with arXiv:2207.09858
♻ ☆ Amplifying Training Data Exposure through Fine-Tuning with Pseudo-Labeled Memberships
Neural language models (LMs) are vulnerable to training data extraction attacks due to data memorization. This paper introduces a novel attack scenario wherein an attacker adversarially fine-tunes pre-trained LMs to amplify the exposure of the original training data. This strategy differs from prior studies by aiming to intensify the LM's retention of its pre-training dataset. To achieve this, the attacker needs to collect generated texts that are closely aligned with the pre-training data. However, without knowledge of the actual dataset, quantifying the amount of pre-training data within generated texts is challenging. To address this, we propose the use of pseudo-labels for these generated texts, leveraging membership approximations indicated by machine-generated probabilities from the target LM. We subsequently fine-tune the LM to favor generations with higher likelihoods of originating from the pre-training data, based on their membership probabilities. Our empirical findings indicate a remarkable outcome: LMs with over 1B parameters exhibit a four to eight-fold increase in training data exposure. We discuss potential mitigations and suggest future research directions.
comment: 20 pages, 6 figures, 15 tables
♻ ☆ AirPilot: Interpretable PPO-based DRL Auto-Tuned Nonlinear PID Drone Controller for Robust Autonomous Flights
Navigation precision, speed and stability are crucial for safe Unmanned Aerial Vehicle (UAV) flight maneuvers and effective flight mission executions in dynamic environments. Different flight missions may have varying objectives, such as minimizing energy consumption, achieving precise positioning, or maximizing speed. A controller that can adapt to different objectives on the fly is highly valuable. Proportional Integral Derivative (PID) controllers are one of the most popular and widely used control algorithms for drones and other control systems, but their linear control algorithm fails to capture the nonlinear nature of the dynamic wind conditions and complex drone system. Manually tuning the PID gains for various missions can be time-consuming and requires significant expertise. This paper aims to revolutionize drone flight control by presenting the AirPilot, a nonlinear Deep Reinforcement Learning (DRL) - enhanced Proportional Integral Derivative (PID) drone controller using Proximal Policy Optimization (PPO). AirPilot controller combines the simplicity and effectiveness of traditional PID control with the adaptability, learning capability, and optimization potential of DRL. This makes it better suited for modern drone applications where the environment is dynamic, and mission-specific performance demands are high. We employed a COEX Clover autonomous drone for training the DRL agent within the simulator and implemented it in a real-world lab setting, which marks a significant milestone as one of the first attempts to apply a DRL-based flight controller on an actual drone. Airpilot is capable of reducing the navigation error of the default PX4 PID position controller by 90%, improving effective navigation speed of a fine-tuned PID controller by 21%, reducing settling time and overshoot by 17% and 16% respectively.
comment: 9 pages, 20 figures
♻ ☆ The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery
One of the grand challenges of artificial general intelligence is developing agents capable of conducting scientific research and discovering new knowledge. While frontier models have already been used as aides to human scientists, e.g. for brainstorming ideas, writing code, or prediction tasks, they still conduct only a small part of the scientific process. This paper presents the first comprehensive framework for fully automatic scientific discovery, enabling frontier large language models to perform research independently and communicate their findings. We introduce The AI Scientist, which generates novel research ideas, writes code, executes experiments, visualizes results, describes its findings by writing a full scientific paper, and then runs a simulated review process for evaluation. In principle, this process can be repeated to iteratively develop ideas in an open-ended fashion, acting like the human scientific community. We demonstrate its versatility by applying it to three distinct subfields of machine learning: diffusion modeling, transformer-based language modeling, and learning dynamics. Each idea is implemented and developed into a full paper at a cost of less than $15 per paper. To evaluate the generated papers, we design and validate an automated reviewer, which we show achieves near-human performance in evaluating paper scores. The AI Scientist can produce papers that exceed the acceptance threshold at a top machine learning conference as judged by our automated reviewer. This approach signifies the beginning of a new era in scientific discovery in machine learning: bringing the transformative benefits of AI agents to the entire research process of AI itself, and taking us closer to a world where endless affordable creativity and innovation can be unleashed on the world's most challenging problems. Our code is open-sourced at https://github.com/SakanaAI/AI-Scientist
♻ ☆ Enhancing Training Efficiency Using Packing with Flash Attention
Padding is often used in tuning LLM models by adding special tokens to shorter training examples to match the length of the longest sequence in each batch. While this ensures uniformity for batch processing, it introduces inefficiencies by including irrelevant padding tokens in the computation and wastes GPU resources. Hugging Face SFT trainer has always offered the option to use packing to combine multiple training examples, allowing for maximal utilization of GPU resources. However, up till now, it did not offer proper masking of each packed training example. This capability has been added to Hugging Face Transformers 4.44. We analyse this new feature and show the benefits across different variations of packing.
Multimedia
☆ MetaDigiHuman: Haptic Interfaces for Digital Humans in Metaverse
The way we engage with digital spaces and the digital world has undergone rapid changes in recent years, largely due to the emergence of the Metaverse. As technology continues to advance, the demand for sophisticated and immersive interfaces to interact with the Metaverse has become increasingly crucial. Haptic interfaces have been developed to meet this need and provide users with tactile feedback and realistic touch sensations. These interfaces play a vital role in creating a more authentic and immersive experience within the Metaverse. This article introduces the concept of MetaDigiHuman, a groundbreaking framework that combines blended digital humans and haptic interfaces. By harnessing cutting-edge technologies, MetaDigiHuman enables seamless and immersive interaction within the Metaverse. Through this framework, users can simulate the sensation of touching, feeling, and interacting with digital beings as if they were physically present in the environments, offering a more compelling and immersive experience within the Metaverse.
comment: 5 pages, 2 figures
☆ Multimodal Multi-turn Conversation Stance Detection: A Challenge Dataset and Effective Model ACM MM2024
Stance detection, which aims to identify public opinion towards specific targets using social media data, is an important yet challenging task. With the proliferation of diverse multimodal social media content including text, and images multimodal stance detection (MSD) has become a crucial research area. However, existing MSD studies have focused on modeling stance within individual text-image pairs, overlooking the multi-party conversational contexts that naturally occur on social media. This limitation stems from a lack of datasets that authentically capture such conversational scenarios, hindering progress in conversational MSD. To address this, we introduce a new multimodal multi-turn conversational stance detection dataset (called MmMtCSD). To derive stances from this challenging dataset, we propose a novel multimodal large language model stance detection framework (MLLM-SD), that learns joint stance representations from textual and visual modalities. Experiments on MmMtCSD show state-of-the-art performance of our proposed MLLM-SD approach for multimodal stance detection. We believe that MmMtCSD will contribute to advancing real-world applications of stance detection research.
comment: ACM MM2024
♻ ☆ Harmonizing Attention: Training-free Texture-aware Geometry Transfer WACV2025
Extracting geometry features from photographic images independently of surface texture and transferring them onto different materials remains a complex challenge. In this study, we introduce Harmonizing Attention, a novel training-free approach that leverages diffusion models for texture-aware geometry transfer. Our method employs a simple yet effective modification of self-attention layers, allowing the model to query information from multiple reference images within these layers. This mechanism is seamlessly integrated into the inversion process as Texture-aligning Attention and into the generation process as Geometry-aligning Attention. This dual-attention approach ensures the effective capture and transfer of material-independent geometry features while maintaining material-specific textural continuity, all without the need for model fine-tuning.
comment: Accepted at WACV2025
Computation and Language
♻ ☆ Hidden flaws behind expert-level accuracy of multimodal GPT-4 vision in medicine
Recent studies indicate that Generative Pre-trained Transformer 4 with Vision (GPT-4V) outperforms human physicians in medical challenge tasks. However, these evaluations primarily focused on the accuracy of multi-choice questions alone. Our study extends the current scope by conducting a comprehensive analysis of GPT-4V's rationales of image comprehension, recall of medical knowledge, and step-by-step multimodal reasoning when solving New England Journal of Medicine (NEJM) Image Challenges - an imaging quiz designed to test the knowledge and diagnostic capabilities of medical professionals. Evaluation results confirmed that GPT-4V performs comparatively to human physicians regarding multi-choice accuracy (81.6% vs. 77.8%). GPT-4V also performs well in cases where physicians incorrectly answer, with over 78% accuracy. However, we discovered that GPT-4V frequently presents flawed rationales in cases where it makes the correct final choices (35.5%), most prominent in image comprehension (27.2%). Regardless of GPT-4V's high accuracy in multi-choice questions, our findings emphasize the necessity for further in-depth evaluations of its rationales before integrating such multimodal AI models into clinical workflows.
♻ ☆ Diffusion Explainer: Visual Explanation for Text-to-image Stable Diffusion
Diffusion-based generative models' impressive ability to create convincing images has garnered global attention. However, their complex structures and operations often pose challenges for non-experts to grasp. We present Diffusion Explainer, the first interactive visualization tool that explains how Stable Diffusion transforms text prompts into images. Diffusion Explainer tightly integrates a visual overview of Stable Diffusion's complex structure with explanations of the underlying operations. By comparing image generation of prompt variants, users can discover the impact of keyword changes on image generation. A 56-participant user study demonstrates that Diffusion Explainer offers substantial learning benefits to non-experts. Our tool has been used by over 10,300 users from 124 countries at https://poloclub.github.io/diffusion-explainer/.
comment: 5 pages, 7 figures
♻ ☆ The Foundational Capabilities of Large Language Models in Predicting Postoperative Risks Using Clinical Notes
Clinical notes recorded during a patient's perioperative journey holds immense informational value. Advances in large language models (LLMs) offer opportunities for bridging this gap. Using 84,875 pre-operative notes and its associated surgical cases from 2018 to 2021, we examine the performance of LLMs in predicting six postoperative risks using various fine-tuning strategies. Pretrained LLMs outperformed traditional word embeddings by an absolute AUROC of 38.3% and AUPRC of 33.2%. Self-supervised fine-tuning further improved performance by 3.2% and 1.5%. Incorporating labels into training further increased AUROC by 1.8% and AUPRC by 2%. The highest performance was achieved with a unified foundation model, with improvements of 3.6% for AUROC and 2.6% for AUPRC compared to self-supervision, highlighting the foundational capabilities of LLMs in predicting postoperative risks, which could be potentially beneficial when deployed for perioperative care
comment: Codes are publicly available at: https://github.com/cja5553/LLMs_in_perioperative_care
♻ ☆ A Survey of Neural Code Intelligence: Paradigms, Advances and Beyond
Neural Code Intelligence -- leveraging deep learning to understand, generate, and optimize code -- holds immense potential for transformative impacts on the whole society. Bridging the gap between Natural Language and Programming Language, this domain has drawn significant attention from researchers in both research communities over the past few years. This survey presents a systematic and chronological review of the advancements in code intelligence, encompassing over 50 representative models and their variants, more than 20 categories of tasks, and an extensive coverage of over 680 related works. We follow the historical progression to trace the paradigm shifts across different research phases (e.g., from modeling code with recurrent neural networks to the era of Large Language Models). Concurrently, we highlight the major technical transitions in models, tasks, and evaluations spanning through different stages. For applications, we also observe a co-evolving shift. It spans from initial endeavors to tackling specific scenarios, through exploring a diverse array of tasks during its rapid expansion, to currently focusing on tackling increasingly complex and varied real-world challenges. Building on our examination of the developmental trajectories, we further investigate the emerging synergies between code intelligence and broader machine intelligence, uncovering new cross-domain opportunities and illustrating the substantial influence of code intelligence across various domains. Finally, we delve into both the opportunities and challenges associated with this field, alongside elucidating our insights on the most promising research directions. An ongoing, dynamically updated project and resources associated with this survey have been released at https://github.com/QiushiSun/NCISurvey.
comment: 64 pages, 6 figures, 10 tables, 695 references
♻ ☆ Characterizing Online Toxicity During the 2022 Mpox Outbreak: A Computational Analysis of Topical and Network Dynamics
Background: Online toxicity, encompassing behaviors such as harassment, bullying, hate speech, and the dissemination of misinformation, has become a pressing social concern in the digital age. The 2022 Mpox outbreak, initially termed "Monkeypox" but subsequently renamed to mitigate associated stigmas and societal concerns, serves as a poignant backdrop to this issue. Objective: In this research, we undertake a comprehensive analysis of the toxic online discourse surrounding the 2022 Mpox outbreak. Our objective is to dissect its origins, characterize its nature and content, trace its dissemination patterns, and assess its broader societal implications, with the goal of providing insights that can inform strategies to mitigate such toxicity in future crises. Methods: We collected more than 1.6 million unique tweets and analyzed them from five dimensions, including context, extent, content, speaker, and intent. Utilizing BERT-based topic modeling and social network community clustering, we delineated the toxic dynamics on Twitter. Results: We identified five high-level topic categories in the toxic online discourse on Twitter, including disease (46.6%), health policy and healthcare (19.3%), homophobia (23.9%), politics (6.0%), and racism (4.1%). Through the toxicity diffusion networks of mentions, retweets, and the top users, we found that retweets of toxic content were widespread, while influential users rarely engaged with or countered this toxicity through retweets. Conclusions: By tracking topical dynamics, we can track the changing popularity of toxic content online, providing a better understanding of societal challenges. Network dynamics spotlight key social media influencers and their intents, indicating that addressing these central figures in toxic discourse can enhance crisis communication and inform policy-making.
comment: 36 pages, 8 figure, and 12 tables
♻ ☆ Is There No Such Thing as a Bad Question? H4R: HalluciBot For Ratiocination, Rewriting, Ranking, and Routing
Hallucination continues to be one of the most critical challenges in the institutional adoption journey of Large Language Models (LLMs). While prior studies have primarily focused on the post-generation analysis and refinement of outputs, this paper centers on the effectiveness of queries in eliciting accurate responses from LLMs. We present HalluciBot, a model that estimates the query's propensity to hallucinate before generation, without invoking any LLMs during inference. HalluciBot can serve as a proxy reward model for query rewriting, offering a general framework to estimate query quality based on accuracy and consensus. In essence, HalluciBot investigates how poorly constructed queries can lead to erroneous outputs - moreover, by employing query rewriting guided by HalluciBot's empirical estimates, we demonstrate that 95.7% output accuracy can be achieved for Multiple Choice questions. The training procedure for HalluciBot consists of perturbing 369,837 queries n times, employing n+1 independent LLM agents, sampling an output from each query, conducting a Multi-Agent Monte Carlo simulation on the sampled outputs, and training an encoder classifier. The idea of perturbation is the outcome of our ablation studies that measures the increase in output diversity (+12.5 agreement spread) by perturbing a query in lexically different but semantically similar ways. Therefore, HalluciBot paves the way to ratiocinate (76.0% test F1 score, 46.6% in saved computation on hallucinatory queries), rewrite (+30.2% positive class transition from hallucinatory to non-hallucinatory), rank (+50.6% positive class transition from hallucinatory to non-hallucinatory), and route queries to effective pipelines.
♻ ☆ Evaluating Large Language Models for Health-related Queries with Presuppositions ACL 2024
As corporations rush to integrate large language models (LLMs) to their search offerings, it is critical that they provide factually accurate information that is robust to any presuppositions that a user may express. In this work, we introduce UPHILL, a dataset consisting of health-related queries with varying degrees of presuppositions. Using UPHILL, we evaluate the factual accuracy and consistency of InstructGPT, ChatGPT, and BingChat models. We find that while model responses rarely disagree with true health claims (posed as questions), they often fail to challenge false claims: responses from InstructGPT agree with 32% of the false claims, ChatGPT 26% and BingChat 23%. As we increase the extent of presupposition in input queries, the responses from InstructGPT and ChatGPT agree with the claim considerably more often, regardless of its veracity. Responses from BingChat, which rely on retrieved webpages, are not as susceptible. Given the moderate factual accuracy, and the inability of models to consistently correct false assumptions, our work calls for a careful assessment of current LLMs for use in high-stakes scenarios.
comment: Findings of ACL 2024
♻ ☆ Towards a theory of how the structure of language is acquired by deep neural networks
How much data is required to learn the structure of a language via next-token prediction? We study this question for synthetic datasets generated via a Probabilistic Context-Free Grammar (PCFG) -- a tree-like generative model that captures many of the hierarchical structures found in natural languages. We determine token-token correlations analytically in our model and show that they can be used to build a representation of the grammar's hidden variables, the longer the range the deeper the variable. In addition, a finite training set limits the resolution of correlations to an effective range, whose size grows with that of the training set. As a result, a Language Model trained with increasingly many examples can build a deeper representation of the grammar's structure, thus reaching good performance despite the high dimensionality of the problem. We conjecture that the relationship between training set size and effective range of correlations holds beyond our synthetic datasets. In particular, our conjecture predicts how the scaling law for the test loss behaviour with training set size depends on the length of the context window, which we confirm empirically in Shakespeare's plays and Wikipedia articles.
comment: 9 pages, 4 figures (main)
♻ ☆ Adversarial Representation with Intra-Modal and Inter-Modal Graph Contrastive Learning for Multimodal Emotion Recognition
With the release of increasing open-source emotion recognition datasets on social media platforms and the rapid development of computing resources, multimodal emotion recognition tasks (MER) have begun to receive widespread research attention. The MER task extracts and fuses complementary semantic information from different modalities, which can classify the speaker's emotions. However, the existing feature fusion methods have usually mapped the features of different modalities into the same feature space for information fusion, which can not eliminate the heterogeneity between different modalities. Therefore, it is challenging to make the subsequent emotion class boundary learning. To tackle the above problems, we have proposed a novel Adversarial Representation with Intra-Modal and Inter-Modal Graph Contrastive for Multimodal Emotion Recognition (AR-IIGCN) method. Firstly, we input video, audio, and text features into a multi-layer perceptron (MLP) to map them into separate feature spaces. Secondly, we build a generator and a discriminator for the three modal features through adversarial representation, which can achieve information interaction between modalities and eliminate heterogeneity among modalities. Thirdly, we introduce contrastive graph representation learning to capture intra-modal and inter-modal complementary semantic information and learn intra-class and inter-class boundary information of emotion categories. Specifically, we construct a graph structure for three modal features and perform contrastive representation learning on nodes with different emotions in the same modality and the same emotion in different modalities, which can improve the feature representation ability of nodes. Extensive experimental works show that the ARL-IIGCN method can significantly improve emotion recognition accuracy on IEMOCAP and MELD datasets.
comment: 14 pages, 6 figures
♻ ☆ Efficient Long-distance Latent Relation-aware Graph Neural Network for Multi-modal Emotion Recognition in Conversations
The task of multi-modal emotion recognition in conversation (MERC) aims to analyze the genuine emotional state of each utterance based on the multi-modal information in the conversation, which is crucial for conversation understanding. Existing methods focus on using graph neural networks (GNN) to model conversational relationships and capture contextual latent semantic relationships. However, due to the complexity of GNN, existing methods cannot efficiently capture the potential dependencies between long-distance utterances, which limits the performance of MERC. In this paper, we propose an Efficient Long-distance Latent Relation-aware Graph Neural Network (ELR-GNN) for multi-modal emotion recognition in conversations. Specifically, we first use pre-extracted text, video and audio features as input to Bi-LSTM to capture contextual semantic information and obtain low-level utterance features. Then, we use low-level utterance features to construct a conversational emotion interaction graph. To efficiently capture the potential dependencies between long-distance utterances, we use the dilated generalized forward push algorithm to precompute the emotional propagation between global utterances and design an emotional relation-aware operator to capture the potential semantic associations between different utterances. Furthermore, we combine early fusion and adaptive late fusion mechanisms to fuse latent dependency information between speaker relationship information and context. Finally, we obtain high-level discourse features and feed them into MLP for emotion prediction. Extensive experimental results show that ELR-GNN achieves state-of-the-art performance on the benchmark datasets IEMOCAP and MELD, with running times reduced by 52\% and 35\%, respectively.
comment: 11 pages, 3 tables
♻ ☆ DER-GCN: Dialogue and Event Relation-Aware Graph Convolutional Neural Network for Multimodal Dialogue Emotion Recognition
With the continuous development of deep learning (DL), the task of multimodal dialogue emotion recognition (MDER) has recently received extensive research attention, which is also an essential branch of DL. The MDER aims to identify the emotional information contained in different modalities, e.g., text, video, and audio, in different dialogue scenes. However, existing research has focused on modeling contextual semantic information and dialogue relations between speakers while ignoring the impact of event relations on emotion. To tackle the above issues, we propose a novel Dialogue and Event Relation-Aware Graph Convolutional Neural Network for Multimodal Emotion Recognition (DER-GCN) method. It models dialogue relations between speakers and captures latent event relations information. Specifically, we construct a weighted multi-relationship graph to simultaneously capture the dependencies between speakers and event relations in a dialogue. Moreover, we also introduce a Self-Supervised Masked Graph Autoencoder (SMGAE) to improve the fusion representation ability of features and structures. Next, we design a new Multiple Information Transformer (MIT) to capture the correlation between different relations, which can provide a better fuse of the multivariate information between relations. Finally, we propose a loss optimization strategy based on contrastive learning to enhance the representation learning ability of minority class features. We conduct extensive experiments on the IEMOCAP and MELD benchmark datasets, which verify the effectiveness of the DER-GCN model. The results demonstrate that our model significantly improves both the average accuracy and the f1 value of emotion recognition.
comment: 14 pages, 7 figures
♻ ☆ Towards Tracing Trustworthiness Dynamics: Revisiting Pre-training Period of Large Language Models ACL 2024
Ensuring the trustworthiness of large language models (LLMs) is crucial. Most studies concentrate on fully pre-trained LLMs to better understand and improve LLMs' trustworthiness. In this paper, to reveal the untapped potential of pre-training, we pioneer the exploration of LLMs' trustworthiness during this period, focusing on five key dimensions: reliability, privacy, toxicity, fairness, and robustness. To begin with, we apply linear probing to LLMs. The high probing accuracy suggests that \textit{LLMs in early pre-training can already distinguish concepts in each trustworthiness dimension}. Therefore, to further uncover the hidden possibilities of pre-training, we extract steering vectors from a LLM's pre-training checkpoints to enhance the LLM's trustworthiness. Finally, inspired by~\citet{choi2023understanding} that mutual information estimation is bounded by linear probing accuracy, we also probe LLMs with mutual information to investigate the dynamics of trustworthiness during pre-training. We are the first to observe a similar two-phase phenomenon: fitting and compression~\citep{shwartz2017opening}. This research provides an initial exploration of trustworthiness modeling during LLM pre-training, seeking to unveil new insights and spur further developments in the field. We will make our code publicly accessible at \url{https://github.com/ChnQ/TracingLLM}.
comment: Accepted at ACL 2024
♻ ☆ A Novel ICD Coding Method Based on Associated and Hierarchical Code Description Distillation
ICD(International Classification of Diseases) coding involves assigning ICD codes to patients visit based on their medical notes. ICD coding is a challenging multilabel text classification problem due to noisy medical document inputs. Recent advancements in automated ICD coding have enhanced performance by integrating additional data and knowledge bases with the encoding of medical notes and codes. However, most of them ignore the code hierarchy, leading to improper code assignments. To address these problems, we propose a novel framework based on associated and hierarchical code description distillation (AHDD) for better code representation learning and avoidance of improper code assignment.we utilize the code description and the hierarchical structure inherent to the ICD codes. Therefore, in this paper, we leverage the code description and the hierarchical structure inherent to the ICD codes. The code description is also applied to aware the attention layer and output layer. Experimental results on the benchmark dataset show the superiority of the proposed framework over several state-of-the-art baselines.
♻ ☆ FENICE: Factuality Evaluation of summarization based on Natural language Inference and Claim Extraction ACL 2024
Recent advancements in text summarization, particularly with the advent of Large Language Models (LLMs), have shown remarkable performance. However, a notable challenge persists as a substantial number of automatically-generated summaries exhibit factual inconsistencies, such as hallucinations. In response to this issue, various approaches for the evaluation of consistency for summarization have emerged. Yet, these newly-introduced metrics face several limitations, including lack of interpretability, focus on short document summaries (e.g., news articles), and computational impracticality, especially for LLM-based metrics. To address these shortcomings, we propose Factuality Evaluation of summarization based on Natural language Inference and Claim Extraction (FENICE), a more interpretable and efficient factuality-oriented metric. FENICE leverages an NLI-based alignment between information in the source document and a set of atomic facts, referred to as claims, extracted from the summary. Our metric sets a new state of the art on AGGREFACT, the de-facto benchmark for factuality evaluation. Moreover, we extend our evaluation to a more challenging setting by conducting a human annotation process of long-form summarization. In the hope of fostering research in summarization factuality evaluation, we release the code of our metric and our factuality annotations of long-form summarization at https://github.com/Babelscape/FENICE.
comment: ACL 2024 camera ready. Code and data at https://github.com/Babelscape/FENICE
♻ ☆ ConSiDERS-The-Human Evaluation Framework: Rethinking Human Evaluation for Generative Large Language Models ACL 2024
In this position paper, we argue that human evaluation of generative large language models (LLMs) should be a multidisciplinary undertaking that draws upon insights from disciplines such as user experience research and human behavioral psychology to ensure that the experimental design and results are reliable. The conclusions from these evaluations, thus, must consider factors such as usability, aesthetics, and cognitive biases. We highlight how cognitive biases can conflate fluent information and truthfulness, and how cognitive uncertainty affects the reliability of rating scores such as Likert. Furthermore, the evaluation should differentiate the capabilities and weaknesses of increasingly powerful large language models -- which requires effective test sets. The scalability of human evaluation is also crucial to wider adoption. Hence, to design an effective human evaluation system in the age of generative NLP, we propose the ConSiDERS-The-Human evaluation framework consisting of 6 pillars -- Consistency, Scoring Criteria, Differentiating, User Experience, Responsible, and Scalability.
comment: Accepted in ACL 2024
Computer Vision and Pattern Recognition
♻ ☆ Hidden flaws behind expert-level accuracy of multimodal GPT-4 vision in medicine
Recent studies indicate that Generative Pre-trained Transformer 4 with Vision (GPT-4V) outperforms human physicians in medical challenge tasks. However, these evaluations primarily focused on the accuracy of multi-choice questions alone. Our study extends the current scope by conducting a comprehensive analysis of GPT-4V's rationales of image comprehension, recall of medical knowledge, and step-by-step multimodal reasoning when solving New England Journal of Medicine (NEJM) Image Challenges - an imaging quiz designed to test the knowledge and diagnostic capabilities of medical professionals. Evaluation results confirmed that GPT-4V performs comparatively to human physicians regarding multi-choice accuracy (81.6% vs. 77.8%). GPT-4V also performs well in cases where physicians incorrectly answer, with over 78% accuracy. However, we discovered that GPT-4V frequently presents flawed rationales in cases where it makes the correct final choices (35.5%), most prominent in image comprehension (27.2%). Regardless of GPT-4V's high accuracy in multi-choice questions, our findings emphasize the necessity for further in-depth evaluations of its rationales before integrating such multimodal AI models into clinical workflows.
♻ ☆ CURLing the Dream: Contrastive Representations for World Modeling in Reinforcement Learning
In this work, we present Curled-Dreamer, a novel reinforcement learning algorithm that integrates contrastive learning into the DreamerV3 framework to enhance performance in visual reinforcement learning tasks. By incorporating the contrastive loss from the CURL algorithm and a reconstruction loss from autoencoder, Curled-Dreamer achieves significant improvements in various DeepMind Control Suite tasks. Our extensive experiments demonstrate that Curled-Dreamer consistently outperforms state-of-the-art algorithms, achieving higher mean and median scores across a diverse set of tasks. The results indicate that the proposed approach not only accelerates learning but also enhances the robustness of the learned policies. This work highlights the potential of combining different learning paradigms to achieve superior performance in reinforcement learning applications.
comment: Paper accepted for 24th International Conference on Control, Automation and Systems (ICCAS)
♻ ☆ Explainable AI: Comparative Analysis of Normal and Dilated ResNet Models for Fundus Disease Classification
This paper presents dilated Residual Network (ResNet) models for disease classification from retinal fundus images. Dilated convolution filters are used to replace normal convolution filters in the higher layers of the ResNet model (dilated ResNet) in order to improve the receptive field compared to the normal ResNet model for disease classification. This study introduces computer-assisted diagnostic tools that employ deep learning, enhanced with explainable AI techniques. These techniques aim to make the tool's decision-making process transparent, thereby enabling medical professionals to understand and trust the AI's diagnostic decision. They are particularly relevant in today's healthcare landscape, where there is a growing demand for transparency in AI applications to ensure their reliability and ethical use. The dilated ResNet is used as a replacement for the normal ResNet to enhance the classification accuracy of retinal eye diseases and reduce the required computing time. The dataset used in this work is the Ocular Disease Intelligent Recognition (ODIR) dataset which is a structured ophthalmic database with eight classes covering most of the common retinal eye diseases. The evaluation metrics used in this work include precision, recall, accuracy, and F1 score. In this work, a comparative study has been made between normal ResNet models and dilated ResNet models on five variants namely ResNet-18, ResNet-34, ResNet-50, ResNet-101, and ResNet-152. The dilated ResNet model shows promising results as compared to normal ResNet with an average F1 score of 0.71, 0.70, 0.69, 0.67, and 0.70 respectively for the above respective variants in ODIR multiclass disease classification.
comment: Added authors' contributions
♻ ☆ Convex Hull Prediction for Adaptive Video Streaming by Recurrent Learning
Adaptive video streaming relies on the construction of efficient bitrate ladders to deliver the best possible visual quality to viewers under bandwidth constraints. The traditional method of content dependent bitrate ladder selection requires a video shot to be pre-encoded with multiple encoding parameters to find the optimal operating points given by the convex hull of the resulting rate-quality curves. However, this pre-encoding step is equivalent to an exhaustive search process over the space of possible encoding parameters, which causes significant overhead in terms of both computation and time expenditure. To reduce this overhead, we propose a deep learning based method of content aware convex hull prediction. We employ a recurrent convolutional network (RCN) to implicitly analyze the spatiotemporal complexity of video shots in order to predict their convex hulls. A two-step transfer learning scheme is adopted to train our proposed RCN-Hull model, which ensures sufficient content diversity to analyze scene complexity, while also making it possible to capture the scene statistics of pristine source videos. Our experimental results reveal that our proposed model yields better approximations of the optimal convex hulls, and offers competitive time savings as compared to existing approaches. On average, the pre-encoding time was reduced by 53.8% by our method, while the average Bjontegaard delta bitrate (BD-rate) of the predicted convex hulls against ground truth was 0.26%, and the mean absolute deviation of the BD-rate distribution was 0.57%.
♻ ☆ Thinking Racial Bias in Fair Forgery Detection: Models, Datasets and Evaluations
Due to the successful development of deep image generation technology, forgery detection plays a more important role in social and economic security. Racial bias has not been explored thoroughly in the deep forgery detection field. In the paper, we first contribute a dedicated dataset called the Fair Forgery Detection (FairFD) dataset, where we prove the racial bias of public state-of-the-art (SOTA) methods. Different from existing forgery detection datasets, the self-constructed FairFD dataset contains a balanced racial ratio and diverse forgery generation images with the largest-scale subjects. Additionally, we identify the problems with naive fairness metrics when benchmarking forgery detection models. To comprehensively evaluate fairness, we design novel metrics including Approach Averaged Metric and Utility Regularized Metric, which can avoid deceptive results. We also present an effective and robust post-processing technique, Bias Pruning with Fair Activations (BPFA), which improves fairness without requiring retraining or weight updates. Extensive experiments conducted with 12 representative forgery detection models demonstrate the value of the proposed dataset and the reasonability of the designed fairness metrics. By applying the BPFA to the existing fairest detector, we achieve a new SOTA. Furthermore, we conduct more in-depth analyses to offer more insights to inspire researchers in the community.
♻ ☆ S3E: A Mulit-Robot Multimodal Dataset for Collaborative SLAM
The burgeoning demand for collaborative robotic systems to execute complex tasks collectively has intensified the research community's focus on advancing simultaneous localization and mapping (SLAM) in a cooperative context. Despite this interest, the scalability and diversity of existing datasets for collaborative trajectories remain limited, especially in scenarios with constrained perspectives where the generalization capabilities of Collaborative SLAM (C-SLAM) are critical for the feasibility of multi-agent missions. Addressing this gap, we introduce S3E, an expansive multimodal dataset. Captured by a fleet of unmanned ground vehicles traversing four distinct collaborative trajectory paradigms, S3E encompasses 13 outdoor and 5 indoor sequences. These sequences feature meticulously synchronized and spatially calibrated data streams, including 360-degree LiDAR point cloud, high-resolution stereo imagery, high-frequency inertial measurement units (IMU), and Ultra-wideband (UWB) relative observations. Our dataset not only surpasses previous efforts in scale, scene diversity, and data intricacy but also provides a thorough analysis and benchmarks for both collaborative and individual SLAM methodologies. For access to the dataset and the latest information, please visit our repository at https://pengyu-team.github.io/S3E.
♻ ☆ CILF-CIAE: CLIP-driven Image-Language Fusion for Correcting Inverse Age Estimation
The age estimation task aims to predict the age of an individual by analyzing facial features in an image. The development of age estimation can improve the efficiency and accuracy of various applications (e.g., age verification and secure access control, etc.). In recent years, contrastive language-image pre-training (CLIP) has been widely used in various multimodal tasks and has made some progress in the field of age estimation. However, existing CLIP-based age estimation methods require high memory usage (quadratic complexity) when globally modeling images, and lack an error feedback mechanism to prompt the model about the quality of age prediction results. To tackle the above issues, we propose a novel CLIP-driven Image-Language Fusion for Correcting Inverse Age Estimation (CILF-CIAE). Specifically, we first introduce the CLIP model to extract image features and text semantic information respectively, and map them into a highly semantically aligned high-dimensional feature space. Next, we designed a new Transformer architecture (i.e., FourierFormer) to achieve channel evolution and spatial interaction of images, and to fuse image and text semantic information. Compared with the quadratic complexity of the attention mechanism, the proposed Fourierformer is of linear log complexity. To further narrow the semantic gap between image and text features, we utilize an efficient contrastive multimodal learning module that supervises the multimodal fusion process of FourierFormer through contrastive loss for image-text matching, thereby improving the interaction effect between different modalities. Finally, we introduce reversible age estimation, which uses end-to-end error feedback to reduce the error rate of age predictions. Through extensive experiments on multiple data sets, CILF-CIAE has achieved better age prediction results.
comment: 14 pages, 14 figures, 3 tables
♻ ☆ Graph Information Bottleneck for Remote Sensing Segmentation
Remote sensing segmentation has a wide range of applications in environmental protection, and urban change detection, etc. Despite the success of deep learning-based remote sensing segmentation methods (e.g., CNN and Transformer), they are not flexible enough to model irregular objects. In addition, existing graph contrastive learning methods usually adopt the way of maximizing mutual information to keep the node representations consistent between different graph views, which may cause the model to learn task-independent redundant information. To tackle the above problems, this paper treats images as graph structures and introduces a simple contrastive vision GNN (SC-ViG) architecture for remote sensing segmentation. Specifically, we construct a node-masked and edge-masked graph view to obtain an optimal graph structure representation, which can adaptively learn whether to mask nodes and edges. Furthermore, this paper innovatively introduces information bottleneck theory into graph contrastive learning to maximize task-related information while minimizing task-independent redundant information. Finally, we replace the convolutional module in UNet with the SC-ViG module to complete the segmentation and classification tasks of remote sensing images. Extensive experiments on publicly available real datasets demonstrate that our method outperforms state-of-the-art remote sensing image segmentation methods.
comment: 13 pages, 6 figures
♻ ☆ Palantir: Towards Efficient Super Resolution for Ultra-high-definition Live Streaming
Neural enhancement through super-resolution (SR) deep neural networks (DNNs) opens up new possibilities for ultra-high-definition (UHD) live streaming over existing encoding and networking infrastructure. Yet, the heavy SR DNN inference overhead leads to severe deployment challenges. To reduce the overhead, existing systems propose to apply DNN-based SR only on carefully selected anchor frames while upscaling non-anchor frames via the lightweight reusing-based SR approach. However, frame-level scheduling is coarse-grained and fails to deliver optimal efficiency. In this work, we propose Palantir, the first neural-enhanced UHD live streaming system with fine-grained patch-level scheduling. Two novel techniques are incorporated into Palantir to select the most beneficial anchor patches and support latency-sensitive UHD live streaming applications. Firstly, under the guidance of our pioneering and theoretical analysis, Palantir constructs a directed acyclic graph (DAG) for lightweight yet accurate SR quality estimation under any possible anchor patch set. Secondly, to further optimize the scheduling latency, Palantir improves parallelizability by refactoring the computation subprocedure of the estimation process into a sparse matrix-matrix multiplication operation. The evaluation results suggest that Palantir incurs a negligible scheduling latency accounting for less than 5.7% of the end-to-end latency requirement. When compared to the naive method of applying DNN-based SR on all the frames, Palantir can reduce the SR DNN inference overhead by 20 times (or 60 times) while preserving 54.0-82.6% (or 32.8-64.0%) of the quality gain. When compared to the state-of-the-art real-time frame-level scheduling strategy, Palantir can reduce the SR DNN inference overhead by 80.1% at most (and 38.4% on average) without sacrificing the video quality.
♻ ☆ Anticipating Future Object Compositions without Forgetting
Despite the significant advancements in computer vision models, their ability to generalize to novel object-attribute compositions remains limited. Existing methods for Compositional Zero-Shot Learning (CZSL) mainly focus on image classification. This paper aims to enhance CZSL in object detection without forgetting prior learned knowledge. We use Grounding DINO and incorporate Compositional Soft Prompting (CSP) into it and extend it with Compositional Anticipation. We achieve a 70.5% improvement over CSP on the harmonic mean (HM) between seen and unseen compositions on the CLEVR dataset. Furthermore, we introduce Contrastive Prompt Tuning to incrementally address model confusion between similar compositions. We demonstrate the effectiveness of this method and achieve an increase of 14.5% in HM across the pretrain, increment, and unseen sets. Collectively, these methods provide a framework for learning various compositions with limited data, as well as improving the performance of underperforming compositions when additional data becomes available.
♻ ☆ InfiniBench: A Comprehensive Benchmark for Large Multimodal Models in Very Long Video Understanding
Understanding long videos, ranging from tens of minutes to several hours, presents unique challenges in video comprehension. Despite the increasing importance of long-form video content, existing benchmarks primarily focus on shorter clips. To address this gap, we introduce InfiniBench a comprehensive benchmark for very long video understanding which presents 1)The longest video duration, averaging 52.59 minutes per video 2) The largest number of question-answer pairs, 108.2K 3) Diversity in questions that examine nine different skills and include both multiple-choice questions and open-ended questions 4) Human-centric, as the video sources come from movies and daily TV shows, with specific human-level question designs such as Movie Spoiler Questions that require critical thinking and comprehensive understanding. Using InfiniBench, we comprehensively evaluate existing Large Multi-Modality Models (LMMs) on each skill, including the commercial models such as GPT-4o and Gemini 1.5 Flash and the open-source models. The evaluation shows significant challenges in our benchmark. Our findings reveal that even leading AI models like GPT-4o and Gemini 1.5 Flash face challenges in achieving high performance in long video understanding, with average accuracies of just 49.16\% and 42.72\%, and average scores of 3.22 and 2.71 out of 5, respectively. We hope this benchmark will stimulate the LMMs community towards long video and human-level understanding. Our benchmark can be accessed at https://vision-cair.github.io/InfiniBench/
comment: 24 pages,25 figures
♻ ☆ Translating Images to Road Network: A Sequence-to-Sequence Perspective ICCV 2023
The extraction of road network is essential for the generation of high-definition maps since it enables the precise localization of road landmarks and their interconnections. However, generating road network poses a significant challenge due to the conflicting underlying combination of Euclidean (e.g., road landmarks location) and non-Euclidean (e.g., road topological connectivity) structures. Existing methods struggle to merge the two types of data domains effectively, but few of them address it properly. Instead, our work establishes a unified representation of both types of data domain by projecting both Euclidean and non-Euclidean data into an integer series called RoadNet Sequence. Further than modeling an auto-regressive sequence-to-sequence Transformer model to understand RoadNet Sequence, we decouple the dependency of RoadNet Sequence into a mixture of auto-regressive and non-autoregressive dependency. Building on this, our proposed non-autoregressive sequence-to-sequence approach leverages non-autoregressive dependencies while fixing the gap towards auto-regressive dependencies, resulting in success on both efficiency and accuracy. We further identify two main bottlenecks in the current RoadNetTransformer on a non-overfitting split of the dataset: poor landmark detection limited by the BEV Encoder and error propagation to topology reasoning. Therefore, we propose Topology-Inherited Training to inherit better topology knowledge into RoadNetTransformer. Additionally, we collect SD-Maps from open-source map datasets and use this prior information to significantly improve landmark detection and reachability. Extensive experiments on nuScenes dataset demonstrate the superiority of RoadNet Sequence representation and the non-autoregressive approach compared to existing state-of-the-art alternatives.
comment: V1 is the ICCV 2023 conference version, and V2 is the extended version
♻ ☆ Geometric Prior Guided Feature Representation Learning for Long-Tailed Classification
Real-world data are long-tailed, the lack of tail samples leads to a significant limitation in the generalization ability of the model. Although numerous approaches of class re-balancing perform well for moderate class imbalance problems, additional knowledge needs to be introduced to help the tail class recover the underlying true distribution when the observed distribution from a few tail samples does not represent its true distribution properly, thus allowing the model to learn valuable information outside the observed domain. In this work, we propose to leverage the geometric information of the feature distribution of the well-represented head class to guide the model to learn the underlying distribution of the tail class. Specifically, we first systematically define the geometry of the feature distribution and the similarity measures between the geometries, and discover four phenomena regarding the relationship between the geometries of different feature distributions. Then, based on four phenomena, feature uncertainty representation is proposed to perturb the tail features by utilizing the geometry of the head class feature distribution. It aims to make the perturbed features cover the underlying distribution of the tail class as much as possible, thus improving the model's generalization performance in the test domain. Finally, we design a three-stage training scheme enabling feature uncertainty modeling to be successfully applied. Experiments on CIFAR-10/100-LT, ImageNet-LT, and iNaturalist2018 show that our proposed approach outperforms other similar methods on most metrics. In addition, the experimental phenomena we discovered are able to provide new perspectives and theoretical foundations for subsequent studies.
comment: This work was accepted by the IJCV 2024
♻ ☆ xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations ECCV24
We present xGen-VideoSyn-1, a text-to-video (T2V) generation model capable of producing realistic scenes from textual descriptions. Building on recent advancements, such as OpenAI's Sora, we explore the latent diffusion model (LDM) architecture and introduce a video variational autoencoder (VidVAE). VidVAE compresses video data both spatially and temporally, significantly reducing the length of visual tokens and the computational demands associated with generating long-sequence videos. To further address the computational costs, we propose a divide-and-merge strategy that maintains temporal consistency across video segments. Our Diffusion Transformer (DiT) model incorporates spatial and temporal self-attention layers, enabling robust generalization across different timeframes and aspect ratios. We have devised a data processing pipeline from the very beginning and collected over 13M high-quality video-text pairs. The pipeline includes multiple steps such as clipping, text detection, motion estimation, aesthetics scoring, and dense captioning based on our in-house video-LLM model. Training the VidVAE and DiT models required approximately 40 and 642 H100 days, respectively. Our model supports over 14-second 720p video generation in an end-to-end way and demonstrates competitive performance against state-of-the-art T2V models.
comment: Accepted by ECCV24 AI4VA
♻ ☆ CMAB: A First National-Scale Multi-Attribute Building Dataset in China Derived from Open Source Data and GeoAI
Rapidly acquiring three-dimensional (3D) building data, including geometric attributes like rooftop, height and orientations, as well as indicative attributes like function, quality, and age, is essential for accurate urban analysis, simulations, and policy updates. Current building datasets suffer from incomplete coverage of building multi-attributes. This paper introduces a geospatial artificial intelligence (GeoAI) framework for large-scale building modeling, presenting the first national-scale Multi-Attribute Building dataset (CMAB), covering 3,667 spatial cities, 29 million buildings, and 21.3 billion square meters of rooftops with an F1-Score of 89.93% in OCRNet-based extraction, totaling 337.7 billion cubic meters of building stock. We trained bootstrap aggregated XGBoost models with city administrative classifications, incorporating features such as morphology, location, and function. Using multi-source data, including billions of high-resolution Google Earth images and 60 million street view images (SVIs), we generated rooftop, height, function, age, and quality attributes for each building. Accuracy was validated through model benchmarks, existing similar products, and manual SVI validation, mostly above 80%. Our dataset and results are crucial for global SDGs and urban planning.
comment: 43 pages, 20 figures
♻ ☆ Graph-Jigsaw Conditioned Diffusion Model for Skeleton-based Video Anomaly Detection WACV
Skeleton-based video anomaly detection (SVAD) is a crucial task in computer vision. Accurately identifying abnormal patterns or events enables operators to promptly detect suspicious activities, thereby enhancing safety. Achieving this demands a comprehensive understanding of human motions, both at body and region levels, while also accounting for the wide variations of performing a single action. However, existing studies fail to simultaneously address these crucial properties. This paper introduces a novel, practical and lightweight framework, namely Graph-Jigsaw Conditioned Diffusion Model for Skeleton-based Video Anomaly Detection (GiCiSAD) to overcome the challenges associated with SVAD. GiCiSAD consists of three novel modules: the Graph Attention-based Forecasting module to capture the spatio-temporal dependencies inherent in the data, the Graph-level Jigsaw Puzzle Maker module to distinguish subtle region-level discrepancies between normal and abnormal motions, and the Graph-based Conditional Diffusion model to generate a wide spectrum of human motions. Extensive experiments on four widely used skeleton-based video datasets show that GiCiSAD outperforms existing methods with significantly fewer training parameters, establishing it as the new state-of-the-art.
comment: Accepted at the Winter Conference on Applications of Computer Vision (WACV). 17 pages, 6 figures, 6 tables
♻ ☆ A Survey for Foundation Models in Autonomous Driving
The advent of foundation models has revolutionized the fields of natural language processing and computer vision, paving the way for their application in autonomous driving (AD). This survey presents a comprehensive review of more than 40 research papers, demonstrating the role of foundation models in enhancing AD. Large language models contribute to planning and simulation in AD, particularly through their proficiency in reasoning, code generation and translation. In parallel, vision foundation models are increasingly adapted for critical tasks such as 3D object detection and tracking, as well as creating realistic driving scenarios for simulation and testing. Multi-modal foundation models, integrating diverse inputs, exhibit exceptional visual understanding and spatial reasoning, crucial for end-to-end AD. This survey not only provides a structured taxonomy, categorizing foundation models based on their modalities and functionalities within the AD domain but also delves into the methods employed in current research. It identifies the gaps between existing foundation models and cutting-edge AD approaches, thereby charting future research directions and proposing a roadmap for bridging these gaps.
Information Retrieval
☆ PSLF: A PID Controller-incorporated Second-order Latent Factor Analysis Model for Recommender System
A second-order-based latent factor (SLF) analysis model demonstrates superior performance in graph representation learning, particularly for high-dimensional and incomplete (HDI) interaction data, by incorporating the curvature information of the loss landscape. However, its objective function is commonly bi-linear and non-convex, causing the SLF model to suffer from a low convergence rate. To address this issue, this paper proposes a PID controller-incorporated SLF (PSLF) model, leveraging two key strategies: a) refining learning error estimation by incorporating the PID controller principles, and b) acquiring second-order information insights through Hessian-vector products. Experimental results on multiple HDI datasets indicate that the proposed PSLF model outperforms four state-of-the-art latent factor models based on advanced optimizers regarding convergence rates and generalization performance.
☆ An Enhanced Batch Query Architecture in Real-time Recommendation CIKM 2024
In industrial recommendation systems on websites and apps, it is essential to recall and predict top-n results relevant to user interests from a content pool of billions within milliseconds. To cope with continuous data growth and improve real-time recommendation performance, we have designed and implemented a high-performance batch query architecture for real-time recommendation systems. Our contributions include optimizing hash structures with a cacheline-aware probing method to enhance coalesced hashing, as well as the implementation of a hybrid storage key-value service built upon it. Our experiments indicate this approach significantly surpasses conventional hash tables in batch query throughput, achieving up to 90% of the query throughput of random memory access when incorporating parallel optimization. The support for NVMe, integrating two-tier storage for hot and cold data, notably reduces resource consumption. Additionally, the system facilitates dynamic updates, automated sharding of attributes and feature embedding tables, and introduces innovative protocols for consistency in batch queries, thereby enhancing the effectiveness of real-time incremental learning updates. This architecture has been deployed and in use in the bilibili recommendation system for over a year, a video content community with hundreds of millions of users, supporting 10x increase in model computation with minimal resource growth, improving outcomes while preserving the system's real-time performance.
comment: 8 pages, 10 figures, CIKM 2024 Applied Research Paper
♻ ☆ Decoding Knowledge Claims: The Evaluation of Scientific Publication Contributions through Semantic Analysis
The surge in scientific publications challenges the use of publication counts as a measure of scientific progress, requiring alternative metrics that emphasize the quality and novelty of scientific contributions rather than sheer quantity. This paper proposes the use of Relaxed Word Mover's Distance (RWMD), a semantic text similarity measure, to evaluate the novelty of scientific papers. We hypothesize that RWMD can more effectively gauge the growth of scientific knowledge. To test such an assumption, we apply RWMD to evaluate seminal papers, with Hirsch's H-Index paper as a primary case study. We compare RWMD results across three groups: 1) H-Index-related papers, 2) scientometric studies, and 3) unrelated papers, aiming to discern redundant literature and hype from genuine innovations. Findings suggest that emphasizing knowledge claims offers a deeper insight into scientific contributions, marking RWMD as a promising alternative method to traditional citation metrics, thus better tracking significant scientific breakthroughs.
Machine Learning
♻ ☆ Adversarial Domain Adaptation for Cross-user Activity Recognition Using Diffusion-based Noise-centred Learning
Human Activity Recognition (HAR) plays a crucial role in various applications such as human-computer interaction and healthcare monitoring. However, challenges persist in HAR models due to the data distribution differences between training and real-world data distributions, particularly evident in cross-user scenarios. This paper introduces a novel framework, termed Diffusion-based Noise-centered Adversarial Learning Domain Adaptation (Diff-Noise-Adv-DA), designed to address these challenges by leveraging generative diffusion modeling and adversarial learning techniques. Traditional HAR models often struggle with the diversity of user behaviors and sensor data distributions. Diff-Noise-Adv-DA innovatively integrates the inherent noise within diffusion models, harnessing its latent information to enhance domain adaptation. Specifically, the framework transforms noise into a critical carrier of activity and domain class information, facilitating robust classification across different user domains. Experimental evaluations demonstrate the effectiveness of Diff-Noise-Adv-DA in improving HAR model performance across different users, surpassing traditional domain adaptation methods. The framework not only mitigates distribution mismatches but also enhances data quality through noise-based denoising techniques.
♻ ☆ CURLing the Dream: Contrastive Representations for World Modeling in Reinforcement Learning
In this work, we present Curled-Dreamer, a novel reinforcement learning algorithm that integrates contrastive learning into the DreamerV3 framework to enhance performance in visual reinforcement learning tasks. By incorporating the contrastive loss from the CURL algorithm and a reconstruction loss from autoencoder, Curled-Dreamer achieves significant improvements in various DeepMind Control Suite tasks. Our extensive experiments demonstrate that Curled-Dreamer consistently outperforms state-of-the-art algorithms, achieving higher mean and median scores across a diverse set of tasks. The results indicate that the proposed approach not only accelerates learning but also enhances the robustness of the learned policies. This work highlights the potential of combining different learning paradigms to achieve superior performance in reinforcement learning applications.
comment: Paper accepted for 24th International Conference on Control, Automation and Systems (ICCAS)
♻ ☆ Online-Score-Aided Federated Learning: Taming the Resource Constraints in Wireless Networks
While FL is a widely popular distributed ML strategy that protects data privacy, time-varying wireless network parameters and heterogeneous system configurations of the wireless device pose significant challenges. Although the limited radio and computational resources of the network and the clients, respectively, are widely acknowledged, two critical yet often ignored aspects are (a) wireless devices can only dedicate a small chunk of their limited storage for the FL task and (b) new training samples may arrive in an online manner in many practical wireless applications. Therefore, we propose a new FL algorithm called OSAFL, specifically designed to learn tasks relevant to wireless applications under these practical considerations. Since it has long been proven that under extreme resource constraints, clients may perform an arbitrary number of local training steps, which may lead to client drift under statistically heterogeneous data distributions, we leverage normalized gradient similarities and exploit weighting clients' updates based on optimized scores that facilitate the convergence rate of the proposed OSAFL algorithm. Our extensive simulation results on two different tasks -- each with three different datasets -- with four popular ML models validate the effectiveness of OSAFL compared to six existing state-of-the-art FL baselines.
comment: Under review for possible publication in IEEE Transactions on Communications
♻ ☆ Kolmogorov-Arnold Network for Online Reinforcement Learning
Kolmogorov-Arnold Networks (KANs) have shown potential as an alternative to Multi-Layer Perceptrons (MLPs) in neural networks, providing universal function approximation with fewer parameters and reduced memory usage. In this paper, we explore the use of KANs as function approximators within the Proximal Policy Optimization (PPO) algorithm. We evaluate this approach by comparing its performance to the original MLP-based PPO using the DeepMind Control Proprio Robotics benchmark. Our results indicate that the KAN-based reinforcement learning algorithm can achieve comparable performance to its MLP-based counterpart, often with fewer parameters. These findings suggest that KANs may offer a more efficient option for reinforcement learning models.
comment: Paper accepted at 24th International Conference on Control, Automation and Systems (ICCAS)
♻ ☆ Diffusion Explainer: Visual Explanation for Text-to-image Stable Diffusion
Diffusion-based generative models' impressive ability to create convincing images has garnered global attention. However, their complex structures and operations often pose challenges for non-experts to grasp. We present Diffusion Explainer, the first interactive visualization tool that explains how Stable Diffusion transforms text prompts into images. Diffusion Explainer tightly integrates a visual overview of Stable Diffusion's complex structure with explanations of the underlying operations. By comparing image generation of prompt variants, users can discover the impact of keyword changes on image generation. A 56-participant user study demonstrates that Diffusion Explainer offers substantial learning benefits to non-experts. Our tool has been used by over 10,300 users from 124 countries at https://poloclub.github.io/diffusion-explainer/.
comment: 5 pages, 7 figures
♻ ☆ Do Concept Bottleneck Models Respect Localities?
Concept-based methods explain model predictions using human-understandable concepts. These models require accurate concept predictors, yet the faithfulness of existing concept predictors to their underlying concepts is unclear. In this paper, we investigate the faithfulness of Concept Bottleneck Models (CBMs), a popular family of concept-based architectures, by looking at whether they respect "localities" in datasets. Localities involve using only relevant features when predicting a concept's value. When localities are not considered, concepts may be predicted based on spuriously correlated features, degrading performance and robustness. This work examines how CBM predictions change when perturbing model inputs, and reveals that CBMs may not capture localities, even when independent concepts are localised to non-overlapping feature subsets. Our empirical and theoretical results demonstrate that datasets with correlated concepts may lead to accurate but uninterpretable models that fail to learn localities. Overall, we find that CBM interpretability is fragile, as CBMs occasionally rely upon spurious features, necessitating further research into the robustness of concept predictors.
comment: Previous Version Accepted at NeurIPs 23 XAI in Action Workshop
♻ ☆ Discovery of Small Ultra-short-period Planets Orbiting KG Dwarfs in Kepler Survey Using GPU Phase Folding and Deep Learning Detection System
Since the discovery of the first hot Jupiter orbiting a solar-type star, 51 Peg, in 1995, more than 4000 exoplanets have been identified using various observational techniques. The formation process of these sub-Earths remains elusive, and acquiring additional samples is essential for investigating this unique population. In our study, we employ a novel GPU Phase Folding algorithm combined with a Convolutional Neural Network, termed the GPFC method, on Kepler photometry data. This method enhances the transit search speed significantly over the traditional Box-fitting Least Squares method, allowing a complete search of the known KOI photometry data within hours using a commercial GPU card. To date, we have identified five promising sub-Earth short-period candidates: K00446.c, K01821.b, K01522.c, K03404.b, and K04978.b. A closer analysis reveals the following characteristics: K00446.c orbits a K dwarf on a 0.645091-day period. With a radius of $0.461R_\oplus$, it ranks as the second smallest USP discovered to date. K01821.b is a sub-Earth with a radius of $0.648R_\oplus$, orbiting a G dwarf over a 0.91978-day period. It is the second smallest USP among all confirmed USPs orbiting G dwarfs in the NASA Archive. K01522.c has a radius of $0.704 R_\oplus$ and completes an orbit around a Sun-like G dwarf in 0.64672 days; K03404.b, with a radius of $0.738 R_\oplus$, orbits a G dwarf on a 0.68074-day period; and K04978.b, with its planetary radius of $0.912 R_\oplus$, orbits a G dwarf, completing an orbit every 0.94197 days. Three of our finds, K01821.b, K01522.c and K03404.b, rank as the smallest planets among all confirmed USPs orbiting G dwarfs in the Kepler dataset. The discovery of these small exoplanets underscores the promising capability of the GPFC method for searching for small, new transiting exoplanets in photometry data from Kepler, TESS, and future space transit missions.
comment: 24 pages, 40 figures; To be published in the Monthly Notices of the Royal Astronomical Society (MNRAS)
♻ ☆ Convex Hull Prediction for Adaptive Video Streaming by Recurrent Learning
Adaptive video streaming relies on the construction of efficient bitrate ladders to deliver the best possible visual quality to viewers under bandwidth constraints. The traditional method of content dependent bitrate ladder selection requires a video shot to be pre-encoded with multiple encoding parameters to find the optimal operating points given by the convex hull of the resulting rate-quality curves. However, this pre-encoding step is equivalent to an exhaustive search process over the space of possible encoding parameters, which causes significant overhead in terms of both computation and time expenditure. To reduce this overhead, we propose a deep learning based method of content aware convex hull prediction. We employ a recurrent convolutional network (RCN) to implicitly analyze the spatiotemporal complexity of video shots in order to predict their convex hulls. A two-step transfer learning scheme is adopted to train our proposed RCN-Hull model, which ensures sufficient content diversity to analyze scene complexity, while also making it possible to capture the scene statistics of pristine source videos. Our experimental results reveal that our proposed model yields better approximations of the optimal convex hulls, and offers competitive time savings as compared to existing approaches. On average, the pre-encoding time was reduced by 53.8% by our method, while the average Bjontegaard delta bitrate (BD-rate) of the predicted convex hulls against ground truth was 0.26%, and the mean absolute deviation of the BD-rate distribution was 0.57%.
♻ ☆ Public Transit Arrival Prediction: a Seq2Seq RNN Approach
Arrival/Travel times for public transit exhibit variability on account of factors like seasonality, dwell times at bus stops, traffic signals, travel demand fluctuation etc. The developing world in particular is plagued by additional factors like lack of lane discipline, excess vehicles, diverse modes of transport and so on. This renders the bus arrival time prediction (BATP) to be a challenging problem especially in the developing world. A novel data-driven model based on recurrent neural networks (RNNs) is proposed for BATP (in real-time) in the current work. The model intelligently incorporates both spatial and temporal correlations in a unique (non-linear) fashion distinct from existing approaches. In particular, we propose a Gated Recurrent Unit (GRU) based Encoder-Decoder(ED) OR Seq2Seq RNN model (originally introduced for language translation) for BATP. The geometry of the dynamic real time BATP problem enables a nice fit with the Encoder-Decoder based RNN structure. We feed relevant additional synchronized inputs (from previous trips) at each step of the decoder (a feature classically unexplored in machine translation applications). Further motivated from accurately modelling congestion influences on travel time prediction, we additionally propose to use a bidirectional layer at the decoder (something unexplored in other time-series based ED application contexts). The effectiveness of the proposed algorithms is demonstrated on real field data collected from challenging traffic conditions. Our experiments indicate that the proposed method outperforms diverse existing state-of-art data-driven approaches proposed for the same problem.
♻ ☆ Is There No Such Thing as a Bad Question? H4R: HalluciBot For Ratiocination, Rewriting, Ranking, and Routing
Hallucination continues to be one of the most critical challenges in the institutional adoption journey of Large Language Models (LLMs). While prior studies have primarily focused on the post-generation analysis and refinement of outputs, this paper centers on the effectiveness of queries in eliciting accurate responses from LLMs. We present HalluciBot, a model that estimates the query's propensity to hallucinate before generation, without invoking any LLMs during inference. HalluciBot can serve as a proxy reward model for query rewriting, offering a general framework to estimate query quality based on accuracy and consensus. In essence, HalluciBot investigates how poorly constructed queries can lead to erroneous outputs - moreover, by employing query rewriting guided by HalluciBot's empirical estimates, we demonstrate that 95.7% output accuracy can be achieved for Multiple Choice questions. The training procedure for HalluciBot consists of perturbing 369,837 queries n times, employing n+1 independent LLM agents, sampling an output from each query, conducting a Multi-Agent Monte Carlo simulation on the sampled outputs, and training an encoder classifier. The idea of perturbation is the outcome of our ablation studies that measures the increase in output diversity (+12.5 agreement spread) by perturbing a query in lexically different but semantically similar ways. Therefore, HalluciBot paves the way to ratiocinate (76.0% test F1 score, 46.6% in saved computation on hallucinatory queries), rewrite (+30.2% positive class transition from hallucinatory to non-hallucinatory), rank (+50.6% positive class transition from hallucinatory to non-hallucinatory), and route queries to effective pipelines.
♻ ☆ Evaluating Large Language Models for Health-related Queries with Presuppositions ACL 2024
As corporations rush to integrate large language models (LLMs) to their search offerings, it is critical that they provide factually accurate information that is robust to any presuppositions that a user may express. In this work, we introduce UPHILL, a dataset consisting of health-related queries with varying degrees of presuppositions. Using UPHILL, we evaluate the factual accuracy and consistency of InstructGPT, ChatGPT, and BingChat models. We find that while model responses rarely disagree with true health claims (posed as questions), they often fail to challenge false claims: responses from InstructGPT agree with 32% of the false claims, ChatGPT 26% and BingChat 23%. As we increase the extent of presupposition in input queries, the responses from InstructGPT and ChatGPT agree with the claim considerably more often, regardless of its veracity. Responses from BingChat, which rely on retrieved webpages, are not as susceptible. Given the moderate factual accuracy, and the inability of models to consistently correct false assumptions, our work calls for a careful assessment of current LLMs for use in high-stakes scenarios.
comment: Findings of ACL 2024
♻ ☆ Towards a theory of how the structure of language is acquired by deep neural networks
How much data is required to learn the structure of a language via next-token prediction? We study this question for synthetic datasets generated via a Probabilistic Context-Free Grammar (PCFG) -- a tree-like generative model that captures many of the hierarchical structures found in natural languages. We determine token-token correlations analytically in our model and show that they can be used to build a representation of the grammar's hidden variables, the longer the range the deeper the variable. In addition, a finite training set limits the resolution of correlations to an effective range, whose size grows with that of the training set. As a result, a Language Model trained with increasingly many examples can build a deeper representation of the grammar's structure, thus reaching good performance despite the high dimensionality of the problem. We conjecture that the relationship between training set size and effective range of correlations holds beyond our synthetic datasets. In particular, our conjecture predicts how the scaling law for the test loss behaviour with training set size depends on the length of the context window, which we confirm empirically in Shakespeare's plays and Wikipedia articles.
comment: 9 pages, 4 figures (main)
♻ ☆ Enhanced Federated Optimization: Adaptive Unbiased Client Sampling with Reduced Variance
Federated Learning (FL) is a distributed learning paradigm to train a global model across multiple devices without collecting local data. In FL, a server typically selects a subset of clients for each training round to optimize resource usage. Central to this process is the technique of unbiased client sampling, which ensures a representative selection of clients. Current methods primarily utilize a random sampling procedure which, despite its effectiveness, achieves suboptimal efficiency owing to the loose upper bound caused by the sampling variance. In this work, by adopting an independent sampling procedure, we propose a federated optimization framework focused on adaptive unbiased client sampling, improving the convergence rate via an online variance reduction strategy. In particular, we present the first adaptive client sampler, K-Vib, employing an independent sampling procedure. K-Vib achieves a linear speed-up on the regret bound $\tilde{\mathcal{O}}\big(N^{\frac{1}{3}}T^{\frac{2}{3}}/K^{\frac{4}{3}}\big)$ within a set communication budget $K$. Empirical studies indicate that K-Vib doubles the speed compared to baseline algorithms, demonstrating significant potential in federated optimization.
comment: Under review
♻ ☆ An experimental evaluation of Deep Reinforcement Learning algorithms for HVAC control
Heating, Ventilation, and Air Conditioning (HVAC) systems are a major driver of energy consumption in commercial and residential buildings. Recent studies have shown that Deep Reinforcement Learning (DRL) algorithms can outperform traditional reactive controllers. However, DRL-based solutions are generally designed for ad hoc setups and lack standardization for comparison. To fill this gap, this paper provides a critical and reproducible evaluation, in terms of comfort and energy consumption, of several state-of-the-art DRL algorithms for HVAC control. The study examines the controllers' robustness, adaptability, and trade-off between optimization goals by using the Sinergym framework. The results obtained confirm the potential of DRL algorithms, such as SAC and TD3, in complex scenarios and reveal several challenges related to generalization and incremental learning.
♻ ☆ Second-Order Fine-Tuning without Pain for LLMs:A Hessian Informed Zeroth-Order Optimizer
Fine-tuning large language models (LLMs) with classic first-order optimizers entails prohibitive GPU memory due to the backpropagation process. Recent works have turned to zeroth-order optimizers for fine-tuning, which save substantial memory by using two forward passes. However, these optimizers are plagued by the heterogeneity of parameter curvatures across different dimensions. In this work, we propose HiZOO, a diagonal Hessian informed zeroth-order optimizer which is the first work to leverage the diagonal Hessian to enhance zeroth-order optimizer for fine-tuning LLMs. What's more, HiZOO avoids the expensive memory cost and only increases one forward pass per step. Extensive experiments on various models (350M~66B parameters) indicate that HiZOO improves model convergence, significantly reducing training steps and effectively enhancing model accuracy. Moreover, we visualize the optimization trajectories of HiZOO on test functions, illustrating its effectiveness in handling heterogeneous curvatures. Lastly, we provide theoretical proofs of convergence for HiZOO. Code is publicly available at https://anonymous.4open.science/r/HiZOO27F8.
♻ ☆ A rapid approach to urban traffic noise mapping with a generative adversarial network
With rapid urbanisation and the accompanying increase in traffic density, traffic noise has become a major concern in urban planning. However, traditional grid noise mapping methods have limitations in terms of time consumption, software costs, and a lack of parameter integration interfaces. These limitations hinder their ability to meet the need for iterative updates and rapid performance feedback in the early design stages of street-scale urban planning. Herein, we developed a rapid urban traffic noise mapping technique that leverages generative adversarial networks (GANs) as a surrogate model. This approach enables the rapid assessment of urban traffic noise distribution by using urban elements such as roads and buildings as the input. The mean values for the mean squared error (RMSE) and structural similarity index (SSIM) are 0.3024 dB(A) and 0.8528, respectively, for the validation dataset. The trained model is integrated into Grasshopper as a tool, facilitating the rapid generation of traffic noise maps. This integration allows urban designers and planners, even those without expertise in acoustics, to easily anticipate changes in acoustics impacts caused by design in the early design stages.
comment: Accepted by Applied Acoustics
♻ ☆ Active Learning of Discrete-Time Dynamics for Uncertainty-Aware Model Predictive Control
Model-based control requires an accurate model of the system dynamics for precisely and safely controlling the robot in complex and dynamic environments. Moreover, in the presence of variations in the operating conditions, the model should be continuously refined to compensate for dynamics changes. In this paper, we present a self-supervised learning approach that actively models the dynamics of nonlinear robotic systems. We combine offline learning from past experience and online learning from current robot interaction with the unknown environment. These two ingredients enable a highly sample-efficient and adaptive learning process, capable of accurately inferring model dynamics in real-time even in operating regimes that greatly differ from the training distribution. Moreover, we design an uncertainty-aware model predictive controller that is heuristically conditioned to the aleatoric (data) uncertainty of the learned dynamics. This controller actively chooses the optimal control actions that (i) optimize the control performance and (ii) improve the efficiency of online learning sample collection. We demonstrate the effectiveness of our method through a series of challenging real-world experiments using a quadrotor system. Our approach showcases high resilience and generalization capabilities by consistently adapting to unseen flight conditions, while it significantly outperforms classical and adaptive control baselines.
♻ ☆ Primal-dual extrapolation methods for monotone inclusions under local Lipschitz continuity
In this paper we consider a class of monotone inclusion (MI) problems of finding a zero of the sum of two monotone operators, in which one operator is maximal monotone while the other is {\it locally Lipschitz} continuous. We propose primal-dual extrapolation methods to solve them using a point and operator extrapolation technique, whose parameters are chosen by a backtracking line search scheme. The proposed methods enjoy an operation complexity of ${\cal O}(\log \epsilon^{-1})$ and ${\cal O}(\epsilon^{-1}\log \epsilon^{-1})$, measured by the number of fundamental operations consisting only of evaluations of one operator and resolvent of the other operator, for finding an $\varepsilon$-residual solution of strongly and non-strongly MI problems, respectively. The latter complexity significantly improves the previously best operation complexity ${\cal O}(\varepsilon^{-2})$. As a byproduct, complexity results of the primal-dual extrapolation methods are also obtained for finding an $\varepsilon$-KKT or $\varepsilon$-residual solution of convex conic optimization, conic constrained saddle point, and variational inequality problems under {\it local Lipschitz} continuity. We provide preliminary numerical results to demonstrate the performance of the proposed methods.
comment: To appear in Mathematics of Operations Research
♻ ☆ Efficient Long-distance Latent Relation-aware Graph Neural Network for Multi-modal Emotion Recognition in Conversations
The task of multi-modal emotion recognition in conversation (MERC) aims to analyze the genuine emotional state of each utterance based on the multi-modal information in the conversation, which is crucial for conversation understanding. Existing methods focus on using graph neural networks (GNN) to model conversational relationships and capture contextual latent semantic relationships. However, due to the complexity of GNN, existing methods cannot efficiently capture the potential dependencies between long-distance utterances, which limits the performance of MERC. In this paper, we propose an Efficient Long-distance Latent Relation-aware Graph Neural Network (ELR-GNN) for multi-modal emotion recognition in conversations. Specifically, we first use pre-extracted text, video and audio features as input to Bi-LSTM to capture contextual semantic information and obtain low-level utterance features. Then, we use low-level utterance features to construct a conversational emotion interaction graph. To efficiently capture the potential dependencies between long-distance utterances, we use the dilated generalized forward push algorithm to precompute the emotional propagation between global utterances and design an emotional relation-aware operator to capture the potential semantic associations between different utterances. Furthermore, we combine early fusion and adaptive late fusion mechanisms to fuse latent dependency information between speaker relationship information and context. Finally, we obtain high-level discourse features and feed them into MLP for emotion prediction. Extensive experimental results show that ELR-GNN achieves state-of-the-art performance on the benchmark datasets IEMOCAP and MELD, with running times reduced by 52\% and 35\%, respectively.
comment: 11 pages, 3 tables
♻ ☆ VA-learning as a more efficient alternative to Q-learning ICML 2023
In reinforcement learning, the advantage function is critical for policy improvement, but is often extracted from a learned Q-function. A natural question is: Why not learn the advantage function directly? In this work, we introduce VA-learning, which directly learns advantage function and value function using bootstrapping, without explicit reference to Q-functions. VA-learning learns off-policy and enjoys similar theoretical guarantees as Q-learning. Thanks to the direct learning of advantage function and value function, VA-learning improves the sample efficiency over Q-learning both in tabular implementations and deep RL agents on Atari-57 games. We also identify a close connection between VA-learning and the dueling architecture, which partially explains why a simple architectural change to DQN agents tends to improve performance.
comment: Accepted to ICML 2023 as a conference paper
♻ ☆ Adaptive Split Balancing for Optimal Random Forest
In this paper, we propose a new random forest algorithm that constructs the trees using a novel adaptive split-balancing method. Rather than relying on the widely-used random feature selection, we propose a permutation-based balanced splitting criterion. The adaptive split balancing forest (ASBF), achieves minimax optimality under the Lipschitz class. Its localized version, which fits local regressions at the leaf level, attains the minimax rate under the broad H\"older class $\mathcal{H}^{q,\beta}$ of problems for any $q\in\mathbb{N}$ and $\beta\in(0,1]$. We identify that over-reliance on auxiliary randomness in tree construction may compromise the approximation power of trees, leading to suboptimal results. Conversely, the proposed less random, permutation-based approach demonstrates optimality over a wide range of models. Although random forests are known to perform well empirically, their theoretical convergence rates are slow. Simplified versions that construct trees without data dependence offer faster rates but lack adaptability during tree growth. Our proposed method achieves optimality in simple, smooth scenarios while adaptively learning the tree structure from the data. Additionally, we establish uniform upper bounds and demonstrate that ASBF improves dimensionality dependence in average treatment effect estimation problems. Simulation studies and real-world applications demonstrate our methods' superior performance over existing random forests.
♻ ☆ A Survey for Foundation Models in Autonomous Driving
The advent of foundation models has revolutionized the fields of natural language processing and computer vision, paving the way for their application in autonomous driving (AD). This survey presents a comprehensive review of more than 40 research papers, demonstrating the role of foundation models in enhancing AD. Large language models contribute to planning and simulation in AD, particularly through their proficiency in reasoning, code generation and translation. In parallel, vision foundation models are increasingly adapted for critical tasks such as 3D object detection and tracking, as well as creating realistic driving scenarios for simulation and testing. Multi-modal foundation models, integrating diverse inputs, exhibit exceptional visual understanding and spatial reasoning, crucial for end-to-end AD. This survey not only provides a structured taxonomy, categorizing foundation models based on their modalities and functionalities within the AD domain but also delves into the methods employed in current research. It identifies the gaps between existing foundation models and cutting-edge AD approaches, thereby charting future research directions and proposing a roadmap for bridging these gaps.
Multimedia
☆ Comparative Analysis of Modality Fusion Approaches for Audio-Visual Person Identification and Verification
Multimodal learning involves integrating information from various modalities to enhance learning and comprehension. We compare three modality fusion strategies in person identification and verification by processing two modalities: voice and face. In this paper, a one-dimensional convolutional neural network is employed for x-vector extraction from voice, while the pre-trained VGGFace2 network and transfer learning are utilized for face modality. In addition, gammatonegram is used as speech representation in engagement with the Darknet19 pre-trained network. The proposed systems are evaluated using the K-fold cross-validation technique on the 118 speakers of the test set of the VoxCeleb2 dataset. The comparative evaluations are done for single-modality and three proposed multimodal strategies in equal situations. Results demonstrate that the feature fusion strategy of gammatonegram and facial features achieves the highest performance, with an accuracy of 98.37% in the person identification task. However, concatenating facial features with the x-vector reaches 0.62% for EER in verification tasks.
comment: This paper has been submitted to a conference
☆ Digit Recognition using Multimodal Spiking Neural Networks
Spiking neural networks (SNNs) are the third generation of neural networks that are biologically inspired to process data in a fashion that emulates the exchange of signals in the brain. Within the Computer Vision community SNNs have garnered significant attention due in large part to the availability of event-based sensors that produce a spatially resolved spike train in response to changes in scene radiance. SNNs are used to process event-based data due to their neuromorphic nature. The proposed work examines the neuromorphic advantage of fusing multiple sensory inputs in classification tasks. Specifically we study the performance of a SNN in digit classification by passing in a visual modality branch (Neuromorphic-MNIST [N-MNIST]) and an auditory modality branch (Spiking Heidelberg Digits [SHD]) from datasets that were created using event-based sensors to generate a series of time-dependent events. It is observed that multi-modal SNNs outperform unimodal visual and unimodal auditory SNNs. Furthermore, it is observed that the process of sensory fusion is insensitive to the depth at which the visual and auditory branches are combined. This work achieves a 98.43% accuracy on the combined N-MNIST and SHD dataset using a multimodal SNN that concatenates the visual and auditory branches at a late depth.
comment: 4 pages, 2 figures, submitted to 2025 IEEE International Conference on Acoustics, Speech, and Signal Processing
☆ Multi-scale Multi-instance Visual Sound Localization and Segmentation
Visual sound localization is a typical and challenging problem that predicts the location of objects corresponding to the sound source in a video. Previous methods mainly used the audio-visual association between global audio and one-scale visual features to localize sounding objects in each image. Despite their promising performance, they omitted multi-scale visual features of the corresponding image, and they cannot learn discriminative regions compared to ground truths. To address this issue, we propose a novel multi-scale multi-instance visual sound localization framework, namely M2VSL, that can directly learn multi-scale semantic features associated with sound sources from the input image to localize sounding objects. Specifically, our M2VSL leverages learnable multi-scale visual features to align audio-visual representations at multi-level locations of the corresponding image. We also introduce a novel multi-scale multi-instance transformer to dynamically aggregate multi-scale cross-modal representations for visual sound localization. We conduct extensive experiments on VGGSound-Instruments, VGG-Sound Sources, and AVSBench benchmarks. The results demonstrate that the proposed M2VSL can achieve state-of-the-art performance on sounding object localization and segmentation.
♻ ☆ Palantir: Towards Efficient Super Resolution for Ultra-high-definition Live Streaming
Neural enhancement through super-resolution (SR) deep neural networks (DNNs) opens up new possibilities for ultra-high-definition (UHD) live streaming over existing encoding and networking infrastructure. Yet, the heavy SR DNN inference overhead leads to severe deployment challenges. To reduce the overhead, existing systems propose to apply DNN-based SR only on carefully selected anchor frames while upscaling non-anchor frames via the lightweight reusing-based SR approach. However, frame-level scheduling is coarse-grained and fails to deliver optimal efficiency. In this work, we propose Palantir, the first neural-enhanced UHD live streaming system with fine-grained patch-level scheduling. Two novel techniques are incorporated into Palantir to select the most beneficial anchor patches and support latency-sensitive UHD live streaming applications. Firstly, under the guidance of our pioneering and theoretical analysis, Palantir constructs a directed acyclic graph (DAG) for lightweight yet accurate SR quality estimation under any possible anchor patch set. Secondly, to further optimize the scheduling latency, Palantir improves parallelizability by refactoring the computation subprocedure of the estimation process into a sparse matrix-matrix multiplication operation. The evaluation results suggest that Palantir incurs a negligible scheduling latency accounting for less than 5.7% of the end-to-end latency requirement. When compared to the naive method of applying DNN-based SR on all the frames, Palantir can reduce the SR DNN inference overhead by 20 times (or 60 times) while preserving 54.0-82.6% (or 32.8-64.0%) of the quality gain. When compared to the state-of-the-art real-time frame-level scheduling strategy, Palantir can reduce the SR DNN inference overhead by 80.1% at most (and 38.4% on average) without sacrificing the video quality.
Computation and Language
☆ Bridging Episodes and Semantics: A Novel Framework for Long-Form Video Understanding ECCV'24
While existing research often treats long-form videos as extended short videos, we propose a novel approach that more accurately reflects human cognition. This paper introduces BREASE: BRidging Episodes And SEmantics for Long-Form Video Understanding, a model that simulates episodic memory accumulation to capture action sequences and reinforces them with semantic knowledge dispersed throughout the video. Our work makes two key contributions: First, we develop an Episodic COmpressor (ECO) that efficiently aggregates crucial representations from micro to semi-macro levels. Second, we propose a Semantics reTRiever (SeTR) that enhances these aggregated representations with semantic information by focusing on the broader context, dramatically reducing feature dimensionality while preserving relevant macro-level information. Extensive experiments demonstrate that BREASE achieves state-of-the-art performance across multiple long video understanding benchmarks in both zero-shot and fully-supervised settings. The project page and code are at: https://joslefaure.github.io/assets/html/hermes.html.
comment: Accepted to the EVAL-FoMo Workshop at ECCV'24. Project page: https://joslefaure.github.io/assets/html/hermes.html
☆ SYNTHEVAL: Hybrid Behavioral Testing of NLP Models with Synthetic CheckLists
Traditional benchmarking in NLP typically involves using static held-out test sets. However, this approach often results in an overestimation of performance and lacks the ability to offer comprehensive, interpretable, and dynamic assessments of NLP models. Recently, works like DynaBench (Kiela et al., 2021) and CheckList (Ribeiro et al., 2020) have addressed these limitations through behavioral testing of NLP models with test types generated by a multistep human-annotated pipeline. Unfortunately, manually creating a variety of test types requires much human labor, often at prohibitive cost. In this work, we propose SYNTHEVAL, a hybrid behavioral testing framework that leverages large language models (LLMs) to generate a wide range of test types for a comprehensive evaluation of NLP models. SYNTHEVAL first generates sentences via LLMs using controlled generation, and then identifies challenging examples by comparing the predictions made by LLMs with task-specific NLP models. In the last stage, human experts investigate the challenging examples, manually design templates, and identify the types of failures the taskspecific models consistently exhibit. We apply SYNTHEVAL to two classification tasks, sentiment analysis and toxic language detection, and show that our framework is effective in identifying weaknesses of strong models on these tasks. We share our code in https://github.com/Loreley99/SynthEval_CheckList.
☆ CLOCR-C: Context Leveraging OCR Correction with Pre-trained Language Models
The digitisation of historical print media archives is crucial for increasing accessibility to contemporary records. However, the process of Optical Character Recognition (OCR) used to convert physical records to digital text is prone to errors, particularly in the case of newspapers and periodicals due to their complex layouts. This paper introduces Context Leveraging OCR Correction (CLOCR-C), which utilises the infilling and context-adaptive abilities of transformer-based language models (LMs) to improve OCR quality. The study aims to determine if LMs can perform post-OCR correction, improve downstream NLP tasks, and the value of providing the socio-cultural context as part of the correction process. Experiments were conducted using seven LMs on three datasets: the 19th Century Serials Edition (NCSE) and two datasets from the Overproof collection. The results demonstrate that some LMs can significantly reduce error rates, with the top-performing model achieving over a 60% reduction in character error rate on the NCSE dataset. The OCR improvements extend to downstream tasks, such as Named Entity Recognition, with increased Cosine Named Entity Similarity. Furthermore, the study shows that providing socio-cultural context in the prompts improves performance, while misleading prompts lower performance. In addition to the findings, this study releases a dataset of 91 transcribed articles from the NCSE, containing a total of 40 thousand words, to support further research in this area. The findings suggest that CLOCR-C is a promising approach for enhancing the quality of existing digital archives by leveraging the socio-cultural information embedded in the LMs and the text requiring correction.
comment: 13 pages, 3 figures, currently under peer review
☆ NDP: Next Distribution Prediction as a More Broad Target
Large language models (LLMs) trained on next-token prediction (NTP) paradigm have demonstrated powerful capabilities. However, the existing NTP paradigm contains several limitations, particularly related to planned task complications and error propagation during inference. In our work, we extend the critique of NTP, highlighting its limitation also due to training with a narrow objective: the prediction of a sub-optimal one-hot distribution. To support this critique, we conducted a pre-experiment treating the output distribution from powerful LLMs as efficient world data compression. By evaluating the similarity between the $n$-gram distribution and the one-hot distribution with LLMs, we observed that the $n$-gram distributions align more closely with the output distribution of LLMs. Based on this insight, we introduce Next Distribution Prediction (NDP), which uses $n$-gram distributions to replace the one-hot targets, enhancing learning without extra online training time. We conducted experiments across translation, general task, language transfer, and medical domain adaptation. Compared to NTP, NDP can achieve up to +2.97 COMET improvement in translation tasks, +0.61 average improvement in general tasks, and incredible +10.75 average improvement in the medical domain. This demonstrates the concrete benefits of addressing the target narrowing problem, pointing to a new direction for future work on improving NTP.
comment: 8 pages,5 figures
☆ Assessing Generative Language Models in Classification Tasks: Performance and Self-Evaluation Capabilities in the Environmental and Climate Change Domain
This paper examines the performance of two Large Language Models (LLMs), GPT3.5 and Llama2 and one Small Language Model (SLM) Gemma, across three different classification tasks within the climate change (CC) and environmental domain. Employing BERT-based models as a baseline, we compare their efficacy against these transformer-based models. Additionally, we assess the models' self-evaluation capabilities by analyzing the calibration of verbalized confidence scores in these text classification tasks. Our findings reveal that while BERT-based models generally outperform both the LLMs and SLM, the performance of the large generative models is still noteworthy. Furthermore, our calibration analysis reveals that although Gemma is well-calibrated in initial tasks, it thereafter produces inconsistent results; Llama is reasonably calibrated, and GPT consistently exhibits strong calibration. Through this research, we aim to contribute to the ongoing discussion on the utility and effectiveness of generative LMs in addressing some of the planet's most urgent issues, highlighting their strengths and limitations in the context of ecology and CC.
comment: 11 pages, to be published in NLDB 2024
☆ Impact of ChatGPT on the writing style of condensed matter physicists
We apply a state-of-the-art difference-in-differences approach to estimate the impact of ChatGPT's release on the writing style of condensed matter papers on arXiv. Our analysis reveals a statistically significant improvement in the English quality of abstracts written by non-native English speakers. Importantly, this improvement remains robust even after accounting for other potential factors, confirming that it can be attributed to the release of ChatGPT. This indicates widespread adoption of the tool. Following the release of ChatGPT, there is a significant increase in the use of unique words, while the frequency of rare words decreases. Across language families, the changes in writing style are significant for authors from the Latin and Ural-Altaic groups, but not for those from the Germanic or other Indo-European groups.
comment: 9 pages, 1 figure, 7 tables
☆ Modularity in Transformers: Investigating Neuron Separability & Specialization
Transformer models are increasingly prevalent in various applications, yet our understanding of their internal workings remains limited. This paper investigates the modularity and task specialization of neurons within transformer architectures, focusing on both vision (ViT) and language (Mistral 7B) models. Using a combination of selective pruning and MoEfication clustering techniques, we analyze the overlap and specialization of neurons across different tasks and data subsets. Our findings reveal evidence of task-specific neuron clusters, with varying degrees of overlap between related tasks. We observe that neuron importance patterns persist to some extent even in randomly initialized models, suggesting an inherent structure that training refines. Additionally, we find that neuron clusters identified through MoEfication correspond more strongly to task-specific neurons in earlier and later layers of the models. This work contributes to a more nuanced understanding of transformer internals and offers insights into potential avenues for improving model interpretability and efficiency.
comment: 11 pages, 6 figures
☆ Investigating Neuron Ablation in Attention Heads: The Case for Peak Activation Centering
The use of transformer-based models is growing rapidly throughout society. With this growth, it is important to understand how they work, and in particular, how the attention mechanisms represent concepts. Though there are many interpretability methods, many look at models through their neuronal activations, which are poorly understood. We describe different lenses through which to view neuron activations, and investigate the effectiveness in language models and vision transformers through various methods of neural ablation: zero ablation, mean ablation, activation resampling, and a novel approach we term 'peak ablation'. Through experimental analysis, we find that in different regimes and models, each method can offer the lowest degradation of model performance compared to other methods, with resampling usually causing the most significant performance deterioration. We make our code available at https://github.com/nickypro/investigating-ablation.
comment: 9 pages, 2 figures, XAI World Conference 2024 Late-Breaking Work
☆ Bridging Domain Knowledge and Process Discovery Using Large Language Models
Discovering good process models is essential for different process analysis tasks such as conformance checking and process improvements. Automated process discovery methods often overlook valuable domain knowledge. This knowledge, including insights from domain experts and detailed process documentation, remains largely untapped during process discovery. This paper leverages Large Language Models (LLMs) to integrate such knowledge directly into process discovery. We use rules derived from LLMs to guide model construction, ensuring alignment with both domain knowledge and actual process executions. By integrating LLMs, we create a bridge between process knowledge expressed in natural language and the discovery of robust process models, advancing process discovery methodologies significantly. To showcase the usability of our framework, we conducted a case study with the UWV employee insurance agency, demonstrating its practical benefits and effectiveness.
comment: This paper is accepted at the AI4BPM 2024 workshop and to be published in their proceedings
☆ Towards Tailored Recovery of Lexical Diversity in Literary Machine Translation
Machine translations are found to be lexically poorer than human translations. The loss of lexical diversity through MT poses an issue in the automatic translation of literature, where it matters not only what is written, but also how it is written. Current methods for increasing lexical diversity in MT are rigid. Yet, as we demonstrate, the degree of lexical diversity can vary considerably across different novels. Thus, rather than aiming for the rigid increase of lexical diversity, we reframe the task as recovering what is lost in the machine translation process. We propose a novel approach that consists of reranking translation candidates with a classifier that distinguishes between original and translated text. We evaluate our approach on 31 English-to-Dutch book translations, and find that, for certain books, our approach retrieves lexical diversity scores that are close to human translation.
comment: Accepted to EAMT 2024
☆ Flexible and Effective Mixing of Large Language Models into a Mixture of Domain Experts
We present a toolkit for creating low-cost Mixture-of-Domain-Experts (MOE) from trained models. The toolkit can be used for creating a mixture from models or from adapters. We perform extensive tests and offer guidance on defining the architecture of the resulting MOE using the toolkit. A public repository is available.
☆ Improving Extraction of Clinical Event Contextual Properties from Electronic Health Records: A Comparative Study
Electronic Health Records are large repositories of valuable clinical data, with a significant portion stored in unstructured text format. This textual data includes clinical events (e.g., disorders, symptoms, findings, medications and procedures) in context that if extracted accurately at scale can unlock valuable downstream applications such as disease prediction. Using an existing Named Entity Recognition and Linking methodology, MedCAT, these identified concepts need to be further classified (contextualised) for their relevance to the patient, and their temporal and negated status for example, to be useful downstream. This study performs a comparative analysis of various natural language models for medical text classification. Extensive experimentation reveals the effectiveness of transformer-based language models, particularly BERT. When combined with class imbalance mitigation techniques, BERT outperforms Bi-LSTM models by up to 28% and the baseline BERT model by up to 16% for recall of the minority classes. The method has been implemented as part of CogStack/MedCAT framework and made available to the community for further research.
☆ Codec Does Matter: Exploring the Semantic Shortcoming of Codec for Audio Language Model
Recent advancements in audio generation have been significantly propelled by the capabilities of Large Language Models (LLMs). The existing research on audio LLM has primarily focused on enhancing the architecture and scale of audio language models, as well as leveraging larger datasets, and generally, acoustic codecs, such as EnCodec, are used for audio tokenization. However, these codecs were originally designed for audio compression, which may lead to suboptimal performance in the context of audio LLM. Our research aims to address the shortcomings of current audio LLM codecs, particularly their challenges in maintaining semantic integrity in generated audio. For instance, existing methods like VALL-E, which condition acoustic token generation on text transcriptions, often suffer from content inaccuracies and elevated word error rates (WER) due to semantic misinterpretations of acoustic tokens, resulting in word skipping and errors. To overcome these issues, we propose a straightforward yet effective approach called X-Codec. X-Codec incorporates semantic features from a pre-trained semantic encoder before the Residual Vector Quantization (RVQ) stage and introduces a semantic reconstruction loss after RVQ. By enhancing the semantic ability of the codec, X-Codec significantly reduces WER in speech synthesis tasks and extends these benefits to non-speech applications, including music and sound generation. Our experiments in text-to-speech, music continuation, and text-to-sound tasks demonstrate that integrating semantic information substantially improves the overall performance of language models in audio generation. Our code and demo are available (Demo: https://x-codec-audio.github.io Code: https://github.com/zhenye234/xcodec)
☆ MaFeRw: Query Rewriting with Multi-Aspect Feedbacks for Retrieval-Augmented Large Language Models
In a real-world RAG system, the current query often involves spoken ellipses and ambiguous references from dialogue contexts, necessitating query rewriting to better describe user's information needs. However, traditional context-based rewriting has minimal enhancement on downstream generation tasks due to the lengthy process from query rewriting to response generation. Some researchers try to utilize reinforcement learning with generation feedback to assist the rewriter, but these sparse rewards provide little guidance in most cases, leading to unstable training and generation results. We find that user's needs are also reflected in the gold document, retrieved documents and ground truth. Therefore, by feeding back these multi-aspect dense rewards to query rewriting, more stable and satisfactory responses can be achieved. In this paper, we propose a novel query rewriting method MaFeRw, which improves RAG performance by integrating multi-aspect feedback from both the retrieval process and generated results. Specifically, we first use manual data to train a T5 model for the rewriter initialization. Next, we design three metrics as reinforcement learning feedback: the similarity between the rewritten query and the gold document, the ranking metrics, and ROUGE between the generation and the ground truth. Inspired by RLAIF, we train three kinds of reward models for the above metrics to achieve more efficient training. Finally, we combine the scores of these reward models as feedback, and use PPO algorithm to explore the optimal query rewriting strategy. Experimental results on two conversational RAG datasets demonstrate that MaFeRw achieves superior generation metrics and more stable training compared to baselines.
☆ Novel-WD: Exploring acquisition of Novel World Knowledge in LLMs Using Prefix-Tuning
Teaching new information to pre-trained large language models (PLM) is a crucial but challenging task. Model adaptation techniques, such as fine-tuning and parameter-efficient training have been shown to store new facts at a slow rate; continual learning is an option but is costly and prone to catastrophic forgetting. This work studies and quantifies how PLM may learn and remember new world knowledge facts that do not occur in their pre-training corpus, which only contains world knowledge up to a certain date. To that purpose, we first propose Novel-WD, a new dataset consisting of sentences containing novel facts extracted from recent Wikidata updates, along with two evaluation tasks in the form of causal language modeling and multiple choice questions (MCQ). We make this dataset freely available to the community, and release a procedure to later build new versions of similar datasets with up-to-date information. We also explore the use of prefix-tuning for novel information learning, and analyze how much information can be stored within a given prefix. We show that a single fact can reliably be encoded within a single prefix, and that the prefix capacity increases with its length and with the base model size.
☆ From Text to Emotion: Unveiling the Emotion Annotation Capabilities of LLMs
Training emotion recognition models has relied heavily on human annotated data, which present diversity, quality, and cost challenges. In this paper, we explore the potential of Large Language Models (LLMs), specifically GPT4, in automating or assisting emotion annotation. We compare GPT4 with supervised models and or humans in three aspects: agreement with human annotations, alignment with human perception, and impact on model training. We find that common metrics that use aggregated human annotations as ground truth can underestimate the performance, of GPT-4 and our human evaluation experiment reveals a consistent preference for GPT-4 annotations over humans across multiple datasets and evaluators. Further, we investigate the impact of using GPT-4 as an annotation filtering process to improve model training. Together, our findings highlight the great potential of LLMs in emotion annotation tasks and underscore the need for refined evaluation methodologies.
comment: to be published in Interspeech 2024
☆ InkubaLM: A small language model for low-resource African languages
High-resource language models often fall short in the African context, where there is a critical need for models that are efficient, accessible, and locally relevant, even amidst significant computing and data constraints. This paper introduces InkubaLM, a small language model with 0.4 billion parameters, which achieves performance comparable to models with significantly larger parameter counts and more extensive training data on tasks such as machine translation, question-answering, AfriMMLU, and the AfriXnli task. Notably, InkubaLM outperforms many larger models in sentiment analysis and demonstrates remarkable consistency across multiple languages. This work represents a pivotal advancement in challenging the conventional paradigm that effective language models must rely on substantial resources. Our model and datasets are publicly available \footnote{\url{https://huggingface.co/lelapa}} to encourage research and development on low-resource languages.
☆ Dynamic Self-Consistency: Leveraging Reasoning Paths for Efficient LLM Sampling
Self-Consistency (SC) is a widely used method to mitigate hallucinations in Large Language Models (LLMs) by sampling the LLM multiple times and outputting the most frequent solution. Despite its benefits, SC results in significant computational costs proportional to the number of samples generated. Previous early-stopping approaches, such as Early Stopping Self Consistency and Adaptive Consistency, have aimed to reduce these costs by considering output consistency, but they do not analyze the quality of the reasoning paths (RPs) themselves. To address this issue, we propose Reasoning-Aware Self-Consistency (RASC), an innovative early-stopping framework that dynamically adjusts the number of sample generations by considering both the output answer and the RPs from Chain of Thought (CoT) prompting. RASC assigns confidence scores sequentially to the generated samples, stops when certain criteria are met, and then employs weighted majority voting to optimize sample usage and enhance answer reliability. We comprehensively test RASC with multiple LLMs across varied QA datasets. RASC outperformed existing methods and significantly reduces sample usage by an average of 80% while maintaining or improving accuracy up to 5% compared to the original SC
☆ Tool-Assisted Agent on SQL Inspection and Refinement in Real-World Scenarios
Recent Text-to-SQL methods leverage large language models (LLMs) by incorporating feedback from the database management system. While these methods effectively address execution errors in SQL queries, they struggle with database mismatches -- errors that do not trigger execution exceptions. Database mismatches include issues such as condition mismatches and stricter constraint mismatches, both of which are more prevalent in real-world scenarios. To address these challenges, we propose a tool-assisted agent framework for SQL inspection and refinement, equipping the LLM-based agent with two specialized tools: a retriever and a detector, designed to diagnose and correct SQL queries with database mismatches. These tools enhance the capability of LLMs to handle real-world queries more effectively. We also introduce Spider-Mismatch, a new dataset specifically constructed to reflect the condition mismatch problems encountered in real-world scenarios. Experimental results demonstrate that our method achieves the highest performance on the averaged results of the Spider and Spider-Realistic datasets in few-shot settings, and it significantly outperforms baseline methods on the more realistic dataset, Spider-Mismatch.
comment: work in progress
☆ MemLong: Memory-Augmented Retrieval for Long Text Modeling
Recent advancements in Large Language Models (LLMs) have yielded remarkable success across diverse fields. However, handling long contexts remains a significant challenge for LLMs due to the quadratic time and space complexity of attention mechanisms and the growing memory consumption of the key-value cache during generation. This work introduces MemLong: Memory-Augmented Retrieval for Long Text Generation, a method designed to enhance the capabilities of long-context language modeling by utilizing an external retriever for historical information retrieval. MemLong combines a non-differentiable ``ret-mem'' module with a partially trainable decoder-only language model and introduces a fine-grained, controllable retrieval attention mechanism that leverages semantic-level relevant chunks. Comprehensive evaluations on multiple long-context language modeling benchmarks demonstrate that MemLong consistently outperforms other state-of-the-art LLMs. More importantly, MemLong can extend the context length on a single 3090 GPU from 4k up to 80k. Our code is available at https://github.com/Bui1dMySea/MemLong
☆ UserSumBench: A Benchmark Framework for Evaluating User Summarization Approaches
Large language models (LLMs) have shown remarkable capabilities in generating user summaries from a long list of raw user activity data. These summaries capture essential user information such as preferences and interests, and therefore are invaluable for LLM-based personalization applications, such as explainable recommender systems. However, the development of new summarization techniques is hindered by the lack of ground-truth labels, the inherent subjectivity of user summaries, and human evaluation which is often costly and time-consuming. To address these challenges, we introduce \UserSumBench, a benchmark framework designed to facilitate iterative development of LLM-based summarization approaches. This framework offers two key components: (1) A reference-free summary quality metric. We show that this metric is effective and aligned with human preferences across three diverse datasets (MovieLens, Yelp and Amazon Review). (2) A novel robust summarization method that leverages time-hierarchical summarizer and self-critique verifier to produce high-quality summaries while eliminating hallucination. This method serves as a strong baseline for further innovation in summarization techniques.
♻ ☆ Evaluating Named Entity Recognition: A comparative analysis of mono- and multilingual transformer models on a novel Brazilian corporate earnings call transcripts dataset
Since 2018, when the Transformer architecture was introduced, Natural Language Processing has gained significant momentum with pre-trained Transformer-based models that can be fine-tuned for various tasks. Most models are pre-trained on large English corpora, making them less applicable to other languages, such as Brazilian Portuguese. In our research, we identified two models pre-trained in Brazilian Portuguese (BERTimbau and PTT5) and two multilingual models (mBERT and mT5). BERTimbau and mBERT use only the Encoder module, while PTT5 and mT5 use both the Encoder and Decoder. Our study aimed to evaluate their performance on a financial Named Entity Recognition (NER) task and determine the computational requirements for fine-tuning and inference. To this end, we developed the Brazilian Financial NER (BraFiNER) dataset, comprising sentences from Brazilian banks' earnings calls transcripts annotated using a weakly supervised approach. Additionally, we introduced a novel approach that reframes the token classification task as a text generation problem. After fine-tuning the models, we evaluated them using performance and error metrics. Our findings reveal that BERT-based models consistently outperform T5-based models. While the multilingual models exhibit comparable macro F1-scores, BERTimbau demonstrates superior performance over PTT5. In terms of error metrics, BERTimbau outperforms the other models. We also observed that PTT5 and mT5 generated sentences with changes in monetary and percentage values, highlighting the importance of accuracy and consistency in the financial domain. Our findings provide insights into the differing performance of BERT- and T5-based models for the NER task.
♻ ☆ Exploring Group and Symmetry Principles in Large Language Models
Large Language Models (LLMs) have demonstrated impressive performance across a wide range of applications; however, assessing their reasoning capabilities remains a significant challenge. In this paper, we introduce a framework grounded in group and symmetry principles, which have played a crucial role in fields such as physics and mathematics, and offer another way to evaluate their capabilities. While the proposed framework is general, to showcase the benefits of employing these properties, we focus on arithmetic reasoning and investigate the performance of these models on four group properties: closure, identity, inverse, and associativity. Our findings reveal that LLMs studied in this work struggle to preserve group properties across different test regimes. In the closure test, we observe biases towards specific outputs and an abrupt degradation in their performance from 100% to 0% after a specific sequence length. They also perform poorly in the identity test, which represents adding irrelevant information in the context, and show sensitivity when subjected to inverse test, which examines the robustness of the model with respect to negation. In addition, we demonstrate that breaking down problems into smaller steps helps LLMs in the associativity test that we have conducted. To support these tests we have developed a synthetic dataset which will be released.
♻ ☆ Hoaxpedia: A Unified Wikipedia Hoax Articles Dataset
Hoaxes are a recognised form of disinformation created deliberately, with potential serious implications in the credibility of reference knowledge resources such as Wikipedia. What makes detecting Wikipedia hoaxes hard is that they often are written according to the official style guidelines. In this work, we first provide a systematic analysis of similarities and discrepancies between legitimate and hoax Wikipedia articles, and introduce Hoaxpedia, a collection of 311 hoax articles (from existing literature and official Wikipedia lists), together with semantically similar legitimate articles, which together form a binary text classification dataset aimed at fostering research in automated hoax detection. In this paper, We report results after analyzing several language models, hoax-to-legit ratios, and the amount of text classifiers are exposed to (full article vs the article's definition alone). Our results suggest that detecting deceitful content in Wikipedia based on content alone is hard but feasible, and complement our analysis with a study on the differences in distributions in edit histories, and find that looking at this feature yields better classification results than context.
♻ ☆ DualKanbaFormer: Kolmogorov-Arnold Networks and State Space Model Transformer for Multimodal Aspect-based Sentiment Analysis
Multimodal aspect-based sentiment analysis (MABSA) enhances sentiment detection by combining text with other data types like images. However, despite setting significant benchmarks, attention mechanisms exhibit limitations in efficiently modelling long-range dependencies between aspect and opinion targets within the text. They also face challenges in capturing global-context dependencies for visual representations. To this end, we propose Kolmogorov-Arnold Networks (KANs) and Selective State Space model (Mamba) transformer (DualKanbaFormer), a novel architecture to address the above issues. We leverage the power of Mamba to capture global context dependencies, Multi-head Attention (MHA) to capture local context dependencies, and KANs to capture non-linear modelling patterns for both textual representations (textual KanbaFormer) and visual representations (visual KanbaFormer). Furthermore, we fuse the textual KanbaFormer and visual KanbaFomer with a gated fusion layer to capture the inter-modality dynamics. According to extensive experimental results, our model outperforms some state-of-the-art (SOTA) studies on two public datasets.
comment: 10 pages, 2 figures, and 3 tables
♻ ☆ Question-Based Retrieval using Atomic Units for Enterprise RAG
Enterprise retrieval augmented generation (RAG) offers a highly flexible framework for combining powerful large language models (LLMs) with internal, possibly temporally changing, documents. In RAG, documents are first chunked. Relevant chunks are then retrieved for a user query, which are passed as context to a synthesizer LLM to generate the query response. However, the retrieval step can limit performance, as incorrect chunks can lead the synthesizer LLM to generate a false response. This work applies a zero-shot adaptation of standard dense retrieval steps for more accurate chunk recall. Specifically, a chunk is first decomposed into atomic statements. A set of synthetic questions are then generated on these atoms (with the chunk as the context). Dense retrieval involves finding the closest set of synthetic questions, and associated chunks, to the user query. It is found that retrieval with the atoms leads to higher recall than retrieval with chunks. Further performance gain is observed with retrieval using the synthetic questions generated over the atoms. Higher recall at the retrieval step enables higher performance of the enterprise LLM using the RAG pipeline.
comment: 14 pages, 5 figures, 5 tables
♻ ☆ Beyond One-Size-Fits-All: Multi-Domain, Multi-Task Framework for Embedding Model Selection
This position paper proposes a systematic approach towards developing a framework to help select the most effective embedding models for natural language processing (NLP) tasks, addressing the challenge posed by the proliferation of both proprietary and open-source encoder models.
comment: It was an initial idea - we plan to work on a detailed version
♻ ☆ Docling Technical Report
This technical report introduces Docling, an easy to use, self-contained, MIT-licensed open-source package for PDF document conversion. It is powered by state-of-the-art specialized AI models for layout analysis (DocLayNet) and table structure recognition (TableFormer), and runs efficiently on commodity hardware in a small resource budget. The code interface allows for easy extensibility and addition of new features and models.
♻ ☆ An Empirical Study of Retrieval Augmented Generation with Chain-of-Thought SC
Since the launch of ChatGPT at the end of 2022, generative dialogue models represented by ChatGPT have quickly become essential tools in daily life. As user expectations increase, enhancing the capability of generative dialogue models to solve complex problems has become a focal point of current research. This paper delves into the effectiveness of the RAFT (Retrieval Augmented Fine-Tuning) method in improving the performance of Generative dialogue models. RAFT combines chain-of-thought with model supervised fine-tuning (SFT) and retrieval augmented generation (RAG), which significantly enhanced the model's information extraction and logical reasoning abilities. We evaluated the RAFT method across multiple datasets and analysed its performance in various reasoning tasks, including long-form QA and short-form QA tasks, tasks in both Chinese and English, and supportive and comparison reasoning tasks. Notably, it addresses the gaps in previous research regarding long-form QA tasks and Chinese datasets. Moreover, we also evaluate the benefit of the chain-of-thought (CoT) in the RAFT method. This work offers valuable insights for studies focused on enhancing the performance of generative dialogue models.
comment: Accepted by ISCSLP 2024
♻ ☆ Language models align with human judgments on key grammatical constructions
Do large language models (LLMs) make human-like linguistic generalizations? Dentella et al. (2023) ("DGL") prompt several LLMs ("Is the following sentence grammatically correct in English?") to elicit grammaticality judgments of 80 English sentences, concluding that LLMs demonstrate a "yes-response bias" and a "failure to distinguish grammatical from ungrammatical sentences". We re-evaluate LLM performance using well-established practices and find that DGL's data in fact provide evidence for just how well LLMs capture human behaviors. Models not only achieve high accuracy overall, but also capture fine-grained variation in human linguistic judgments.
comment: Published in PNAS at https://www.pnas.org/doi/10.1073/pnas.2400917121 as response to Dentella et al. (2023)
♻ ☆ Diversifying the Mixture-of-Experts Representation for Language Models with Orthogonal Optimizer ECAI 2024
The Mixture of Experts (MoE) has emerged as a highly successful technique in deep learning, based on the principle of divide-and-conquer to maximize model capacity without significant additional computational cost. Even in the era of large-scale language models (LLMs), MoE continues to play a crucial role, as some researchers have indicated that GPT-4 adopts the MoE structure to ensure diverse inference results. However, MoE is susceptible to performance degeneracy, particularly evident in the issues of imbalance and homogeneous representation among experts. While previous studies have extensively addressed the problem of imbalance, the challenge of homogeneous representation remains unresolved. In this study, we shed light on the homogeneous representation problem, wherein experts in the MoE fail to specialize and lack diversity, leading to frustratingly high similarities in their representations (up to 99\% in a well-performed MoE model). This problem restricts the expressive power of the MoE and, we argue, contradicts its original intention. To tackle this issue, we propose a straightforward yet highly effective solution: OMoE, an orthogonal expert optimizer. Additionally, we introduce an alternating training strategy that encourages each expert to update in a direction orthogonal to the subspace spanned by other experts. Our algorithm facilitates MoE training in two key ways: firstly, it explicitly enhances representation diversity, and secondly, it implicitly fosters interaction between experts during orthogonal weights computation. Through extensive experiments, we demonstrate that our proposed optimization algorithm significantly improves the performance of fine-tuning the MoE model on the GLUE benchmark, SuperGLUE benchmark, question-answering task, and name entity recognition tasks.
comment: ECAI 2024
♻ ☆ EUvsDisinfo: A Dataset for Multilingual Detection of Pro-Kremlin Disinformation in News Articles CIKM 2024
This work introduces EUvsDisinfo, a multilingual dataset of disinformation articles originating from pro-Kremlin outlets, along with trustworthy articles from credible / less biased sources. It is sourced directly from the debunk articles written by experts leading the EUvsDisinfo project. Our dataset is the largest to-date resource in terms of the overall number of articles and distinct languages. It also provides the largest topical and temporal coverage. Using this dataset, we investigate the dissemination of pro-Kremlin disinformation across different languages, uncovering language-specific patterns targeting certain disinformation topics. We further analyse the evolution of topic distribution over an eight-year period, noting a significant surge in disinformation content before the full-scale invasion of Ukraine in 2022. Lastly, we demonstrate the dataset's applicability in training models to effectively distinguish between disinformation and trustworthy content in multilingual settings.
comment: Published at CIKM 2024
♻ ☆ Jailbreak Attacks and Defenses Against Large Language Models: A Survey
Large Language Models (LLMs) have performed exceptionally in various text-generative tasks, including question answering, translation, code completion, etc. However, the over-assistance of LLMs has raised the challenge of "jailbreaking", which induces the model to generate malicious responses against the usage policy and society by designing adversarial prompts. With the emergence of jailbreak attack methods exploiting different vulnerabilities in LLMs, the corresponding safety alignment measures are also evolving. In this paper, we propose a comprehensive and detailed taxonomy of jailbreak attack and defense methods. For instance, the attack methods are divided into black-box and white-box attacks based on the transparency of the target model. Meanwhile, we classify defense methods into prompt-level and model-level defenses. Additionally, we further subdivide these attack and defense methods into distinct sub-classes and present a coherent diagram illustrating their relationships. We also conduct an investigation into the current evaluation methods and compare them from different perspectives. Our findings aim to inspire future research and practical implementations in safeguarding LLMs against adversarial attacks. Above all, although jailbreak remains a significant concern within the community, we believe that our work enhances the understanding of this domain and provides a foundation for developing more secure LLMs.
♻ ☆ Expert-Token Resonance: Redefining MoE Routing through Affinity-Driven Active Selection
Mixture-of-Experts (MoE) architectures have emerged as a paradigm-shifting approach for large language models (LLMs), offering unprecedented computational efficiency. However, these architectures grapple with challenges of token distribution imbalance and expert homogenization, impeding optimal semantic generalization. We introduce a novel framework that redefines MoE routing through affinity-driven active selection. The innovations for the framework encompass: (1) A rigorous formulation of expert-token affinity metrics. (2) An adaptive bidirectional selection mechanism leveraging resonance between experts and tokens. (3) Theoretical derivation and experimental evidence of reduced expert capacity bounds under dynamic token distribution evolution. It is also integrated with orthogonal feature extraction module and an optimized loss function for expert localization. Our theoretical analysis demonstrates that this approach mitigates expert homogenization while enabling substantial capacity boundary reduction. Experimental validation corroborates these findings: it achieves a 40% reduction in token processed by each expert without compromising model convergence or efficacy. When coupled with communication optimizations, the training efficiency improvements of 5.4% to 46.6% can be observed. After supervised fine-tuning, it exhibits performance gains of 9.7% to 14.1% across GDAD, C-Eval, and TeleQnA benchmarks.
♻ ☆ TaSL: Task Skill Localization and Consolidation for Language Model Continual Learning ACL 2024
Language model continual learning (CL) has recently attracted significant interest for its ability to adapt large language models (LLMs) to dynamic real-world scenarios without retraining. A major challenge in this domain is catastrophic forgetting, where models lose previously acquired knowledge upon learning new tasks. Existing approaches commonly utilize multiple parameter-efficient fine-tuning (PEFT) blocks to acquire task-specific knowledge, yet these methods are inefficient and fail to leverage potential knowledge transfer across tasks. In this paper, we introduce a novel CL framework for language models, named Task Skill Localization and Consolidation (TaSL), which boosts knowledge transfer without depending on memory replay. TaSL initially segregates the model into 'skill units' based on parameter dependencies, allowing for more precise control. Subsequently, it employs a novel group-wise skill localization technique to ascertain the importance distribution of skill units for a new task. By comparing this importance distribution with those from previous tasks, we implement a fine-grained skill consolidation strategy that retains task-specific knowledge, thereby preventing forgetting, and updates task-shared knowledge, which facilitates bi-directional knowledge transfer. As a result, TaSL achieves an optimal balance between retaining prior knowledge and excelling in new tasks. TaSL also demonstrates strong generalizability, making it suitable for various base models and adaptable to PEFT methods like LoRA. Furthermore, it offers notable extensibility, supporting enhancements through integration with memory replay techniques. Comprehensive experiments conducted on two CL benchmarks, involving models ranging from 220M to 7B parameters, affirm the effectiveness of TaSL and its variants across different settings.
comment: Extension of ACL 2024 paper titled: Continual Dialog State Tracking via Task Skill Localization and Consolidation
♻ ☆ ConCodeEval: Evaluating Large Language Models for Code Constraints in Domain-Specific Languages
Recent work shows Large Language Models (LLMs) struggle to understand natural language constraints for various text generation tasks in zero- and few-shot settings. While, in the code domain, there is wide usage of constraints in code format to maintain the integrity of code written in Domain-Specific Languages (DSLs) like JSON and YAML which are widely used for system-level programming tasks in enterprises. Given that LLMs are increasingly used for system-level code tasks, evaluating if they can comprehend these code constraints is crucial. However, no work has been done to evaluate their controllability over code constraints. Hence, we introduce ConCodeEval, a first-of-its-kind benchmark having two novel tasks for code constraints across five representations. Our findings suggest that language models struggle with code constraints. Code languages that perform excellently for normal code tasks do not perform well when the same languages represent fine-grained constraints.
♻ ☆ Contextualized Automatic Speech Recognition with Dynamic Vocabulary
Deep biasing (DB) enhances the performance of end-to-end automatic speech recognition (E2E-ASR) models for rare words or contextual phrases using a bias list. However, most existing methods treat bias phrases as sequences of subwords in a predefined static vocabulary. This naive sequence decomposition produces unnatural token patterns, significantly lowering their occurrence probability. More advanced techniques address this problem by expanding the vocabulary with additional modules, including the external language model shallow fusion or rescoring. However, they result in increasing the workload due to the additional modules. This paper proposes a dynamic vocabulary where bias tokens can be added during inference. Each entry in a bias list is represented as a single token, unlike a sequence of existing subword tokens. This approach eliminates the need to learn subword dependencies within the bias phrases. This method is easily applied to various architectures because it only expands the embedding and output layers in common E2E-ASR architectures. Experimental results demonstrate that the proposed method improves the bias phrase WER on English and Japanese datasets by 3.1 -- 4.9 points compared with the conventional DB method.
♻ ☆ Causal-Guided Active Learning for Debiasing Large Language Models ACL 2024
Although achieving promising performance, recent analyses show that current generative large language models (LLMs) may still capture dataset biases and utilize them for generation, leading to poor generalizability and harmfulness of LLMs. However, due to the diversity of dataset biases and the over-optimization problem, previous prior-knowledge-based debiasing methods and fine-tuning-based debiasing methods may not be suitable for current LLMs. To address this issue, we explore combining active learning with the causal mechanisms and propose a casual-guided active learning (CAL) framework, which utilizes LLMs itself to automatically and autonomously identify informative biased samples and induce the bias patterns. Then a cost-effective and efficient in-context learning based method is employed to prevent LLMs from utilizing dataset biases during generation. Experimental results show that CAL can effectively recognize typical biased instances and induce various bias patterns for debiasing LLMs.
comment: Accepted as ACL 2024 main conference & Rewared as Outstanding Paper
♻ ☆ Towards Achieving Human Parity on End-to-end Simultaneous Speech Translation via LLM Agent
In this paper, we present Cross Language Agent -- Simultaneous Interpretation, CLASI, a high-quality and human-like Simultaneous Speech Translation (SiST) System. Inspired by professional human interpreters, we utilize a novel data-driven read-write strategy to balance the translation quality and latency. To address the challenge of translating in-domain terminologies, CLASI employs a multi-modal retrieving module to obtain relevant information to augment the translation. Supported by LLMs, our approach can generate error-tolerated translation by considering the input audio, historical context, and retrieved information. Experimental results show that our system outperforms other systems by significant margins. Aligned with professional human interpreters, we evaluate CLASI with a better human evaluation metric, valid information proportion (VIP), which measures the amount of information that can be successfully conveyed to the listeners. In the real-world scenarios, where the speeches are often disfluent, informal, and unclear, CLASI achieves VIP of 81.3% and 78.0% for Chinese-to-English and English-to-Chinese translation directions, respectively. In contrast, state-of-the-art commercial or open-source systems only achieve 35.4% and 41.6%. On the extremely hard dataset, where other systems achieve under 13% VIP, CLASI can still achieve 70% VIP.
comment: Authors are listed in alphabetical order by last name. Demonstrations and human-annotated test sets are available at https://byteresearchcla.github.io/clasi
♻ ☆ SciLitLLM: How to Adapt LLMs for Scientific Literature Understanding
Scientific literature understanding is crucial for extracting targeted information and garnering insights, thereby significantly advancing scientific discovery. Despite the remarkable success of Large Language Models (LLMs), they face challenges in scientific literature understanding, primarily due to (1) a lack of scientific knowledge and (2) unfamiliarity with specialized scientific tasks. To develop an LLM specialized in scientific literature understanding, we propose a hybrid strategy that integrates continual pre-training (CPT) and supervised fine-tuning (SFT), to simultaneously infuse scientific domain knowledge and enhance instruction-following capabilities for domain-specific tasks.cIn this process, we identify two key challenges: (1) constructing high-quality CPT corpora, and (2) generating diverse SFT instructions. We address these challenges through a meticulous pipeline, including PDF text extraction, parsing content error correction, quality filtering, and synthetic instruction creation. Applying this strategy, we present a suite of LLMs: SciLitLLM, specialized in scientific literature understanding. These models demonstrate promising performance on scientific literature understanding benchmarks. Our contributions are threefold: (1) We present an effective framework that integrates CPT and SFT to adapt LLMs to scientific literature understanding, which can also be easily adapted to other domains. (2) We propose an LLM-based synthesis method to generate diverse and high-quality scientific instructions, resulting in a new instruction set -- SciLitIns -- for supervised fine-tuning in less-represented scientific domains. (3) SciLitLLM achieves promising performance improvements on scientific literature understanding benchmarks.
♻ ☆ AgentsCourt: Building Judicial Decision-Making Agents with Court Debate Simulation and Legal Knowledge Augmentation ACL
With the development of deep learning, natural language processing technology has effectively improved the efficiency of various aspects of the traditional judicial industry. However, most current efforts focus on tasks within individual judicial stages, making it difficult to handle complex tasks that span multiple stages. As the autonomous agents powered by large language models are becoming increasingly smart and able to make complex decisions in real-world settings, offering new insights for judicial intelligence. In this paper, (1) we propose a novel multi-agent framework, AgentsCourt, for judicial decision-making. Our framework follows the classic court trial process, consisting of court debate simulation, legal resources retrieval and decision-making refinement to simulate the decision-making of judge. (2) we introduce SimuCourt, a judicial benchmark that encompasses 420 Chinese judgment documents, spanning the three most common types of judicial cases. Furthermore, to support this task, we construct a large-scale legal knowledge base, Legal-KB, with multi-resource legal knowledge. (3) Extensive experiments show that our framework outperforms the existing advanced methods in various aspects, especially in generating legal articles, where our model achieves significant improvements of 8.6% and 9.1% F1 score in the first and second instance settings, respectively.
comment: This paper was first submitted to ACL ARR 2024 April (Under review)
♻ ☆ Does CLIP Bind Concepts? Probing Compositionality in Large Image Models
Large-scale neural network models combining text and images have made incredible progress in recent years. However, it remains an open question to what extent such models encode compositional representations of the concepts over which they operate, such as correctly identifying "red cube" by reasoning over the constituents "red" and "cube". In this work, we focus on the ability of a large pretrained vision and language model (CLIP) to encode compositional concepts and to bind variables in a structure-sensitive way (e.g., differentiating "cube behind sphere" from "sphere behind cube"). To inspect the performance of CLIP, we compare several architectures from research on compositional distributional semantics models (CDSMs), a line of research that attempts to implement traditional compositional linguistic structures within embedding spaces. We benchmark them on three synthetic datasets - single-object, two-object, and relational - designed to test concept binding. We find that CLIP can compose concepts in a single-object setting, but in situations where concept binding is needed, performance drops dramatically. At the same time, CDSMs also perform poorly, with best performance at chance level.
comment: Lewis and Nayak contributed equally
♻ ☆ Measuring Dimensions of Self-Presentation in Twitter Bios and their Links to Misinformation Sharing
Social media platforms provide users with a profile description field, commonly known as a ``bio," where they can present themselves to the world. A growing literature shows that text in these bios can improve our understanding of online self-presentation and behavior, but existing work relies exclusively on keyword-based approaches to do so. We here propose and evaluate a suite of \hl{simple, effective, and theoretically motivated} approaches to embed bios in spaces that capture salient dimensions of social meaning, such as age and partisanship. We \hl{evaluate our methods on four tasks, showing that the strongest one out-performs several practical baselines.} We then show the utility of our method in helping understand associations between self-presentation and the sharing of URLs from low-quality news sites on Twitter\hl{, with a particular focus on explore the interactions between age and partisanship, and exploring the effects of self-presentations of religiosity}. Our work provides new tools to help computational social scientists make use of information in bios, and provides new insights into how misinformation sharing may be perceived on Twitter.
♻ ☆ Token-level Direct Preference Optimization
Fine-tuning pre-trained Large Language Models (LLMs) is essential to align them with human values and intentions. This process often utilizes methods like pairwise comparisons and KL divergence against a reference LLM, focusing on the evaluation of full answers generated by the models. However, the generation of these responses occurs in a token level, following a sequential, auto-regressive fashion. In this paper, we introduce Token-level Direct Preference Optimization (TDPO), a novel approach to align LLMs with human preferences by optimizing policy at the token level. Unlike previous methods, which face challenges in divergence efficiency, TDPO incorporates forward KL divergence constraints for each token, improving alignment and diversity. Utilizing the Bradley-Terry model for a token-based reward system, TDPO enhances the regulation of KL divergence, while preserving simplicity without the need for explicit reward modeling. Experimental results across various text tasks demonstrate TDPO's superior performance in balancing alignment with generation diversity. Notably, fine-tuning with TDPO strikes a better balance than DPO in the controlled sentiment generation and single-turn dialogue datasets, and significantly improves the quality of generated responses compared to both DPO and PPO-based RLHF methods. Our code is open-sourced at https://github.com/Vance0124/Token-level-Direct-Preference-Optimization.
♻ ☆ Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming
Recent advances in language models have achieved significant progress. GPT-4o, as a new milestone, has enabled real-time conversations with humans, demonstrating near-human natural fluency. Such human-computer interaction necessitates models with the capability to perform reasoning directly with the audio modality and generate output in streaming. However, this remains beyond the reach of current academic models, as they typically depend on extra TTS systems for speech synthesis, resulting in undesirable latency. This paper introduces the Mini-Omni, an audio-based end-to-end conversational model, capable of real-time speech interaction. To achieve this capability, we propose a text-instructed speech generation method, along with batch-parallel strategies during inference to further boost the performance. Our method also helps to retain the original model's language capabilities with minimal degradation, enabling other works to establish real-time interaction capabilities. We call this training method "Any Model Can Talk". We also introduce the VoiceAssistant-400K dataset to fine-tune models optimized for speech output. To our best knowledge, Mini-Omni is the first fully end-to-end, open-source model for real-time speech interaction, offering valuable potential for future research.
comment: Technical report, work in progress. Demo and code: https://github.com/gpt-omni/mini-omni
♻ ☆ Advancing Chinese biomedical text mining with community challenges
Objective: This study aims to review the recent advances in community challenges for biomedical text mining in China. Methods: We collected information of evaluation tasks released in community challenges of biomedical text mining, including task description, dataset description, data source, task type and related links. A systematic summary and comparative analysis were conducted on various biomedical natural language processing tasks, such as named entity recognition, entity normalization, attribute extraction, relation extraction, event extraction, text classification, text similarity, knowledge graph construction, question answering, text generation, and large language model evaluation. Results: We identified 39 evaluation tasks from 6 community challenges that spanned from 2017 to 2023. Our analysis revealed the diverse range of evaluation task types and data sources in biomedical text mining. We explored the potential clinical applications of these community challenge tasks from a translational biomedical informatics perspective. We compared with their English counterparts, and discussed the contributions, limitations, lessons and guidelines of these community challenges, while highlighting future directions in the era of large language models. Conclusion: Community challenge evaluation competitions have played a crucial role in promoting technology innovation and fostering interdisciplinary collaboration in the field of biomedical text mining. These challenges provide valuable platforms for researchers to develop state-of-the-art solutions.
♻ ☆ Etalon: Holistic Performance Evaluation Framework for LLM Inference Systems
Serving large language models (LLMs) in production can incur substantial costs, which has prompted recent advances in inference system optimizations. Today, these systems are evaluated against conventional latency and throughput metrics (eg. TTFT, TBT, Normalised Latency and TPOT). However, these metrics fail to fully capture the nuances of LLM inference, leading to an incomplete assessment of user-facing performance crucial for real-time applications such as chat and translation. In this paper, we first identify the pitfalls of current performance metrics in evaluating LLM inference systems. We then propose Etalon, a comprehensive performance evaluation framework that includes fluidity-index -- a novel metric designed to reflect the intricacies of the LLM inference process and its impact on real-time user experience. Finally, we evaluate various existing open-source platforms and model-as-a-service offerings using Etalon, discussing their strengths and weaknesses. Etalon is available at https://github.com/project-etalon/etalon.
♻ ☆ Weakly-Supervised 3D Visual Grounding based on Visual Linguistic Alignment
Learning to ground natural language queries to target objects or regions in 3D point clouds is quite essential for 3D scene understanding. Nevertheless, existing 3D visual grounding approaches require a substantial number of bounding box annotations for text queries, which is time-consuming and labor-intensive to obtain. In this paper, we propose 3D-VLA, a weakly supervised approach for 3D visual grounding based on Visual Linguistic Alignment. Our 3D-VLA exploits the superior ability of current large-scale vision-language models (VLMs) on aligning the semantics between texts and 2D images, as well as the naturally existing correspondences between 2D images and 3D point clouds, and thus implicitly constructs correspondences between texts and 3D point clouds with no need for fine-grained box annotations in the training procedure. During the inference stage, the learned text-3D correspondence will help us ground the text queries to the 3D target objects even without 2D images. To the best of our knowledge, this is the first work to investigate 3D visual grounding in a weakly supervised manner by involving large scale vision-language models, and extensive experiments on ReferIt3D and ScanRefer datasets demonstrate that our 3D-VLA achieves comparable and even superior results over the fully supervised methods.
♻ ☆ Improving Relational Database Interactions with Large Language Models: Column Descriptions and Their Impact on Text-to-SQL Performance
Relational databases often suffer from uninformative descriptors of table contents, such as ambiguous columns and hard-to-interpret values, impacting both human users and Text-to-SQL models. This paper explores the use of large language models (LLMs) to generate informative column descriptions as a semantic layer for relational databases. Using the BIRD-Bench development set, we created ColSQL, a dataset with gold-standard column descriptions generated and refined by LLMs and human annotators. We evaluated several instruction-tuned models, finding that GPT-4o and Command R+ excelled in generating high-quality descriptions. Additionally, we applied an LLM-as-a-judge to evaluate model performance. Although this method does not align well with human evaluations, we included it to explore its potential and to identify areas for improvement. More work is needed to improve the reliability of automatic evaluations for this task. We also find that detailed column descriptions significantly improve Text-to-SQL execution accuracy, especially when columns are uninformative. This study establishes LLMs as effective tools for generating detailed metadata, enhancing the usability of relational databases.
♻ ☆ Rasa: Building Expressive Speech Synthesis Systems for Indian Languages in Low-resource Settings INTERSPEECH 2024
We release Rasa, the first multilingual expressive TTS dataset for any Indian language, which contains 10 hours of neutral speech and 1-3 hours of expressive speech for each of the 6 Ekman emotions covering 3 languages: Assamese, Bengali, & Tamil. Our ablation studies reveal that just 1 hour of neutral and 30 minutes of expressive data can yield a Fair system as indicated by MUSHRA scores. Increasing neutral data to 10 hours, with minimal expressive data, significantly enhances expressiveness. This offers a practical recipe for resource-constrained languages, prioritizing easily obtainable neutral data alongside smaller amounts of expressive data. We show the importance of syllabically balanced data and pooling emotions to enhance expressiveness. We also highlight challenges in generating specific emotions, e.g., fear and surprise.
comment: Accepted at INTERSPEECH 2024. First two authors listed contributed equally
♻ ☆ Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
We introduce phi-3-mini, a 3.8 billion parameter language model trained on 3.3 trillion tokens, whose overall performance, as measured by both academic benchmarks and internal testing, rivals that of models such as Mixtral 8x7B and GPT-3.5 (e.g., phi-3-mini achieves 69% on MMLU and 8.38 on MT-bench), despite being small enough to be deployed on a phone. Our training dataset is a scaled-up version of the one used for phi-2, composed of heavily filtered publicly available web data and synthetic data. The model is also further aligned for robustness, safety, and chat format. We also provide parameter-scaling results with a 7B, 14B models trained for 4.8T tokens, called phi-3-small, phi-3-medium, both significantly more capable than phi-3-mini (e.g., respectively 75%, 78% on MMLU, and 8.7, 8.9 on MT-bench). To enhance multilingual, multimodal, and long-context capabilities, we introduce three models in the phi-3.5 series: phi-3.5-mini, phi-3.5-MoE, and phi-3.5-Vision. The phi-3.5-MoE, a 16 x 3.8B MoE model with 6.6 billion active parameters, achieves superior performance in language reasoning, math, and code tasks compared to other open-source models of similar scale, such as Llama 3.1 and the Mixtral series, and on par with Gemini-1.5-Flash and GPT-4o-mini. Meanwhile, phi-3.5-Vision, a 4.2 billion parameter model derived from phi-3.5-mini, excels in reasoning tasks and is adept at handling both single-image and text prompts, as well as multi-image and text prompts.
comment: 24 pages
♻ ☆ The optimal placement of the head in the noun phrase. The case of demonstrative, numeral, adjective and noun
The word order of a sentence is shaped by multiple principles. The principle of syntactic dependency distance minimization is in conflict with the principle of surprisal minimization (or predictability maximization) in single head syntactic dependency structures: while the former predicts that the head should be placed at the center of the linear arrangement, the latter predicts that the head should be placed at one of the ends (either first or last). A critical question is when surprisal minimization (or predictability maximization) should surpass syntactic dependency distance minimization. In the context of single head structures, it has been predicted that this is more likely to happen when two conditions are met, i.e. (a) fewer words are involved and (b) words are shorter. Here we test the prediction on the noun phrase when it is composed of a demonstrative, a numeral, an adjective and a noun. We find that, across preferred orders in languages, the noun tends to be placed at one of the ends, confirming the theoretical prediction. We also show evidence of anti locality effects: syntactic dependency distances in preferred orders are longer than expected by chance.
comment: typos corrected
♻ ☆ Training Language Models to Generate Text with Citations via Fine-grained Rewards ACL 2024
While recent Large Language Models (LLMs) have proven useful in answering user queries, they are prone to hallucination, and their responses often lack credibility due to missing references to reliable sources. An intuitive solution to these issues would be to include in-text citations referring to external documents as evidence. While previous works have directly prompted LLMs to generate in-text citations, their performances are far from satisfactory, especially when it comes to smaller LLMs. In this work, we propose an effective training framework using fine-grained rewards to teach LLMs to generate highly supportive and relevant citations, while ensuring the correctness of their responses. We also conduct a systematic analysis of applying these fine-grained rewards to common LLM training strategies, demonstrating its advantage over conventional practices. We conduct extensive experiments on Question Answering (QA) datasets taken from the ALCE benchmark and validate the model's generalizability using EXPERTQA. On LLaMA-2-7B, the incorporation of fine-grained rewards achieves the best performance among the baselines, even surpassing that of GPT-3.5-turbo.
comment: Accepted by ACL 2024
♻ ☆ Conversation Disentanglement with Bi-Level Contrastive Learning EMNLP 2022
Conversation disentanglement aims to group utterances into detached sessions, which is a fundamental task in processing multi-party conversations. Existing methods have two main drawbacks. First, they overemphasize pairwise utterance relations but pay inadequate attention to the utterance-to-context relation modeling. Second, huge amount of human annotated data is required for training, which is expensive to obtain in practice. To address these issues, we propose a general disentangle model based on bi-level contrastive learning. It brings closer utterances in the same session while encourages each utterance to be near its clustered session prototypes in the representation space. Unlike existing approaches, our disentangle model works in both supervised setting with labeled data and unsupervised setting when no such data is available. The proposed method achieves new state-of-the-art performance on both settings across several public datasets.
comment: Accepted by EMNLP 2022 Findings
♻ ☆ Jina-ColBERT-v2: A General-Purpose Multilingual Late Interaction Retriever
Multi-vector dense models, such as ColBERT, have proven highly effective in information retrieval. ColBERT's late interaction scoring approximates the joint query-document attention seen in cross-encoders while maintaining inference efficiency closer to traditional dense retrieval models, thanks to its bi-encoder architecture and recent optimizations in indexing and search. In this paper, we introduce a novel architecture and a training framework to support long context window and multilingual retrieval. Leveraging Matryoshka Representation Loss, we further demonstrate that the reducing the embedding dimensionality from 128 to 64 has insignificant impact on the model's retrieval performance and cut storage requirements by up to 50%. Our new model, Jina-ColBERT-v2, demonstrates strong performance across a range of English and multilingual retrieval tasks,
Computer Vision and Pattern Recognition
☆ Bridging Episodes and Semantics: A Novel Framework for Long-Form Video Understanding ECCV'24
While existing research often treats long-form videos as extended short videos, we propose a novel approach that more accurately reflects human cognition. This paper introduces BREASE: BRidging Episodes And SEmantics for Long-Form Video Understanding, a model that simulates episodic memory accumulation to capture action sequences and reinforces them with semantic knowledge dispersed throughout the video. Our work makes two key contributions: First, we develop an Episodic COmpressor (ECO) that efficiently aggregates crucial representations from micro to semi-macro levels. Second, we propose a Semantics reTRiever (SeTR) that enhances these aggregated representations with semantic information by focusing on the broader context, dramatically reducing feature dimensionality while preserving relevant macro-level information. Extensive experiments demonstrate that BREASE achieves state-of-the-art performance across multiple long video understanding benchmarks in both zero-shot and fully-supervised settings. The project page and code are at: https://joslefaure.github.io/assets/html/hermes.html.
comment: Accepted to the EVAL-FoMo Workshop at ECCV'24. Project page: https://joslefaure.github.io/assets/html/hermes.html
☆ DARES: Depth Anything in Robotic Endoscopic Surgery with Self-supervised Vector-LoRA of the Foundation Model
Robotic-assisted surgery (RAS) relies on accurate depth estimation for 3D reconstruction and visualization. While foundation models like Depth Anything Models (DAM) show promise, directly applying them to surgery often yields suboptimal results. Fully fine-tuning on limited surgical data can cause overfitting and catastrophic forgetting, compromising model robustness and generalization. Although Low-Rank Adaptation (LoRA) addresses some adaptation issues, its uniform parameter distribution neglects the inherent feature hierarchy, where earlier layers, learning more general features, require more parameters than later ones. To tackle this issue, we introduce Depth Anything in Robotic Endoscopic Surgery (DARES), a novel approach that employs a new adaptation technique, Vector Low-Rank Adaptation (Vector-LoRA) on the DAM V2 to perform self-supervised monocular depth estimation in RAS scenes. To enhance learning efficiency, we introduce Vector-LoRA by integrating more parameters in earlier layers and gradually decreasing parameters in later layers. We also design a reprojection loss based on the multi-scale SSIM error to enhance depth perception by better tailoring the foundation model to the specific requirements of the surgical environment. The proposed method is validated on the SCARED dataset and demonstrates superior performance over recent state-of-the-art self-supervised monocular depth estimation techniques, achieving an improvement of 13.3% in the absolute relative error metric. The code and pre-trained weights are available at https://github.com/mobarakol/DARES.
comment: 11 pages
☆ CinePreGen: Camera Controllable Video Previsualization via Engine-powered Diffusion
With advancements in video generative AI models (e.g., SORA), creators are increasingly using these techniques to enhance video previsualization. However, they face challenges with incomplete and mismatched AI workflows. Existing methods mainly rely on text descriptions and struggle with camera placement, a key component of previsualization. To address these issues, we introduce CinePreGen, a visual previsualization system enhanced with engine-powered diffusion. It features a novel camera and storyboard interface that offers dynamic control, from global to local camera adjustments. This is combined with a user-friendly AI rendering workflow, which aims to achieve consistent results through multi-masked IP-Adapter and engine simulation guidelines. In our comprehensive evaluation study, we demonstrate that our system reduces development viscosity (i.e., the complexity and challenges in the development process), meets users' needs for extensive control and iteration in the design process, and outperforms other AI video production workflows in cinematic camera movement, as shown by our experiments and a within-subjects user study. With its intuitive camera controls and realistic rendering of camera motion, CinePreGen shows great potential for improving video production for both individual creators and industry professionals.
☆ Open-vocabulary Temporal Action Localization using VLMs
Video action localization aims to find timings of a specific action from a long video. Although existing learning-based approaches have been successful, those require annotating videos that come with a considerable labor cost. This paper proposes a learning-free, open-vocabulary approach based on emerging vision-language models (VLM). The challenge stems from the fact that VLMs are neither designed to process long videos nor tailored for finding actions. We overcome these problems by extending an iterative visual prompting technique. Specifically, we sample video frames into a concatenated image with frame index labels, making a VLM guess a frame that is considered to be closest to the start/end of the action. Iterating this process by narrowing a sampling time window results in finding a specific frame of start and end of an action. We demonstrate that this sampling technique yields reasonable results, illustrating a practical extension of VLMs for understanding videos.
comment: 7 pages, 5 figures, 4 tables. Last updated on August 30th, 2024
☆ Generative AI Enables Medical Image Segmentation in Ultra Low-Data Regimes
Semantic segmentation of medical images is pivotal in applications like disease diagnosis and treatment planning. While deep learning has excelled in automating this task, a major hurdle is the need for numerous annotated segmentation masks, which are resource-intensive to produce due to the required expertise and time. This scenario often leads to ultra low-data regimes, where annotated images are extremely limited, posing significant challenges for the generalization of conventional deep learning methods on test images. To address this, we introduce a generative deep learning framework, which uniquely generates high-quality paired segmentation masks and medical images, serving as auxiliary data for training robust models in data-scarce environments. Unlike traditional generative models that treat data generation and segmentation model training as separate processes, our method employs multi-level optimization for end-to-end data generation. This approach allows segmentation performance to directly influence the data generation process, ensuring that the generated data is specifically tailored to enhance the performance of the segmentation model. Our method demonstrated strong generalization performance across 9 diverse medical image segmentation tasks and on 16 datasets, in ultra-low data regimes, spanning various diseases, organs, and imaging modalities. When applied to various segmentation models, it achieved performance improvements of 10-20\% (absolute), in both same-domain and out-of-domain scenarios. Notably, it requires 8 to 20 times less training data than existing methods to achieve comparable results. This advancement significantly improves the feasibility and cost-effectiveness of applying deep learning in medical imaging, particularly in scenarios with limited data availability.
☆ How Knowledge Distillation Mitigates the Synthetic Gap in Fair Face Recognition ECCV 2024
Leveraging the capabilities of Knowledge Distillation (KD) strategies, we devise a strategy to fight the recent retraction of face recognition datasets. Given a pretrained Teacher model trained on a real dataset, we show that carefully utilising synthetic datasets, or a mix between real and synthetic datasets to distil knowledge from this teacher to smaller students can yield surprising results. In this sense, we trained 33 different models with and without KD, on different datasets, with different architectures and losses. And our findings are consistent, using KD leads to performance gains across all ethnicities and decreased bias. In addition, it helps to mitigate the performance gap between real and synthetic datasets. This approach addresses the limitations of synthetic data training, improving both the accuracy and fairness of face recognition models.
comment: Accepted at ECCV 2024 Workshops
☆ Look, Learn and Leverage (L$^3$): Mitigating Visual-Domain Shift and Discovering Intrinsic Relations via Symbolic Alignment
Modern deep learning models have demonstrated outstanding performance on discovering the underlying mechanisms when both visual appearance and intrinsic relations (e.g., causal structure) data are sufficient, such as Disentangled Representation Learning (DRL), Causal Representation Learning (CRL) and Visual Question Answering (VQA) methods. However, generalization ability of these models is challenged when the visual domain shifts and the relations data is absent during finetuning. To address this challenge, we propose a novel learning framework, Look, Learn and Leverage (L$^3$), which decomposes the learning process into three distinct phases and systematically utilize the class-agnostic segmentation masks as the common symbolic space to align visual domains. Thus, a relations discovery model can be trained on the source domain, and when the visual domain shifts and the intrinsic relations are absent, the pretrained relations discovery model can be directly reused and maintain a satisfactory performance. Extensive performance evaluations are conducted on three different tasks: DRL, CRL and VQA, and show outstanding results on all three tasks, which reveals the advantages of L$^3$.
comment: 17 pages, 9 figures, 6 tables
☆ LSMS: Language-guided Scale-aware MedSegmentor for Medical Image Referring Segmentation
Conventional medical image segmentation methods have been found inadequate in facilitating physicians with the identification of specific lesions for diagnosis and treatment. Given the utility of text as an instructional format, we introduce a novel task termed Medical Image Referring Segmentation (MIRS), which requires segmenting specified lesions in images based on the given language expressions. Due to the varying object scales in medical images, MIRS demands robust vision-language modeling and comprehensive multi-scale interaction for precise localization and segmentation under linguistic guidance. However, existing medical image segmentation methods fall short in meeting these demands, resulting in insufficient segmentation accuracy. In response, we propose an approach named Language-guided Scale-aware MedSegmentor (LSMS), incorporating two appealing designs: (1)~a Scale-aware Vision-Language Attention module that leverages diverse convolutional kernels to acquire rich visual knowledge and interact closely with linguistic features, thereby enhancing lesion localization capability; (2)~a Full-Scale Decoder that globally models multi-modal features across various scales, capturing complementary information between scales to accurately outline lesion boundaries. Addressing the lack of suitable datasets for MIRS, we constructed a vision-language medical dataset called Reference Hepatic Lesion Segmentation (RefHL-Seg). This dataset comprises 2,283 abdominal CT slices from 231 cases, with corresponding textual annotations and segmentation masks for various liver lesions in images. We validated the performance of LSMS for MIRS and conventional medical image segmentation tasks across various datasets. Our LSMS consistently outperforms on all datasets with lower computational costs. The code and datasets will be released.
comment: 14 pages, 5 figures
☆ Enhancing Underwater Imaging with 4-D Light Fields: Dataset and Method
In this paper, we delve into the realm of 4-D light fields (LFs) to enhance underwater imaging plagued by light absorption, scattering, and other challenges. Contrasting with conventional 2-D RGB imaging, 4-D LF imaging excels in capturing scenes from multiple perspectives, thereby indirectly embedding geometric information. This intrinsic property is anticipated to effectively address the challenges associated with underwater imaging. By leveraging both explicit and implicit depth cues present in 4-D LF images, we propose a progressive, mutually reinforcing framework for underwater 4-D LF image enhancement and depth estimation. Specifically, our framework explicitly utilizes estimated depth information alongside implicit depth-related dynamic convolutional kernels to modulate output features. The entire framework decomposes this complex task, iteratively optimizing the enhanced image and depth information to progressively achieve optimal enhancement results. More importantly, we construct the first 4-D LF-based underwater image dataset for quantitative evaluation and supervised training of learning-based methods, comprising 75 underwater scenes and 3675 high-resolution 2K pairs. To craft vibrant and varied underwater scenes, we build underwater environments with various objects and adopt several types of degradation. Through extensive experimentation, we showcase the potential and superiority of 4-D LF-based underwater imaging vis-a-vis traditional 2-D RGB-based approaches. Moreover, our method effectively corrects color bias and achieves state-of-the-art performance. The dataset and code will be publicly available at https://github.com/linlos1234/LFUIE.
comment: 14 pages, 14 figures
☆ Evaluating Reliability in Medical DNNs: A Critical Analysis of Feature and Confidence-Based OOD Detection MICCAI 2023
Reliable use of deep neural networks (DNNs) for medical image analysis requires methods to identify inputs that differ significantly from the training data, called out-of-distribution (OOD), to prevent erroneous predictions. OOD detection methods can be categorised as either confidence-based (using the model's output layer for OOD detection) or feature-based (not using the output layer). We created two new OOD benchmarks by dividing the D7P (dermatology) and BreastMNIST (ultrasound) datasets into subsets which either contain or don't contain an artefact (rulers or annotations respectively). Models were trained with artefact-free images, and images with the artefacts were used as OOD test sets. For each OOD image, we created a counterfactual by manually removing the artefact via image processing, to assess the artefact's impact on the model's predictions. We show that OOD artefacts can boost a model's softmax confidence in its predictions, due to correlations in training data among other factors. This contradicts the common assumption that OOD artefacts should lead to more uncertain outputs, an assumption on which most confidence-based methods rely. We use this to explain why feature-based methods (e.g. Mahalanobis score) typically have greater OOD detection performance than confidence-based methods (e.g. MCP). However, we also show that feature-based methods typically perform worse at distinguishing between inputs that lead to correct and incorrect predictions (for both OOD and ID data). Following from these insights, we argue that a combination of feature-based and confidence-based methods should be used within DNN pipelines to mitigate their respective weaknesses. These project's code and OOD benchmarks are available at: https://github.com/HarryAnthony/Evaluating_OOD_detection.
comment: Accepted for the Uncertainty for Safe Utilization of Machine Learning in Medical Imaging (UNSURE 2024) workshop at the MICCAI 2023
☆ Investigating Neuron Ablation in Attention Heads: The Case for Peak Activation Centering
The use of transformer-based models is growing rapidly throughout society. With this growth, it is important to understand how they work, and in particular, how the attention mechanisms represent concepts. Though there are many interpretability methods, many look at models through their neuronal activations, which are poorly understood. We describe different lenses through which to view neuron activations, and investigate the effectiveness in language models and vision transformers through various methods of neural ablation: zero ablation, mean ablation, activation resampling, and a novel approach we term 'peak ablation'. Through experimental analysis, we find that in different regimes and models, each method can offer the lowest degradation of model performance compared to other methods, with resampling usually causing the most significant performance deterioration. We make our code available at https://github.com/nickypro/investigating-ablation.
comment: 9 pages, 2 figures, XAI World Conference 2024 Late-Breaking Work
☆ Structuring a Training Strategy to Robustify Perception Models with Realistic Image Augmentations
Advancing Machine Learning (ML)-based perception models for autonomous systems necessitates addressing weak spots within the models, particularly in challenging Operational Design Domains (ODDs). These are environmental operating conditions of an autonomous vehicle which can contain difficult conditions, e.g., lens flare at night or objects reflected in a wet street. This report introduces a novel methodology for training with augmentations to enhance model robustness and performance in such conditions. The proposed approach leverages customized physics-based augmentation functions, to generate realistic training data that simulates diverse ODD scenarios. We present a comprehensive framework that includes identifying weak spots in ML models, selecting suitable augmentations, and devising effective training strategies. The methodology integrates hyperparameter optimization and latent space optimization to fine-tune augmentation parameters, ensuring they maximally improve the ML models' performance. Experimental results demonstrate improvements in model performance, as measured by commonly used metrics such as mean Average Precision (mAP) and mean Intersection over Union (mIoU) on open-source object detection and semantic segmentation models and datasets. Our findings emphasize that optimal training strategies are model- and data-specific and highlight the benefits of integrating augmentations into the training pipeline. By incorporating augmentations, we observe enhanced robustness of ML-based perception models, making them more resilient to edge cases encountered in real-world ODDs. This work underlines the importance of customized augmentations and offers an effective solution for improving the safety and reliability of autonomous driving functions.
☆ BOP-D: Revisiting 6D Pose Estimation Benchmark for Better Evaluation under Visual Ambiguities
Currently, 6D pose estimation methods are benchmarked on datasets that consider, for their ground truth annotations, visual ambiguities as only related to global object symmetries. However, as previously observed [26], visual ambiguities can also happen depending on the viewpoint or the presence of occluding objects, when disambiguating parts become hidden. The visual ambiguities are therefore actually different across images. We thus first propose an automatic method to re-annotate those datasets with a 6D pose distribution specific to each image, taking into account the visibility of the object surface in the image to correctly determine the visual ambiguities. Given this improved ground truth, we re-evaluate the state-of-the-art methods and show this greatly modify the ranking of these methods. Our annotations also allow us to benchmark recent methods able to estimate a pose distribution on real images for the first time. We will make our annotations for the T-LESS dataset and our code publicly available.
☆ DCUDF2: Improving Efficiency and Accuracy in Extracting Zero Level Sets from Unsigned Distance Fields
Unsigned distance fields (UDFs) allow for the representation of models with complex topologies, but extracting accurate zero level sets from these fields poses significant challenges, particularly in preserving topological accuracy and capturing fine geometric details. To overcome these issues, we introduce DCUDF2, an enhancement over DCUDF--the current state-of-the-art method--for extracting zero level sets from UDFs. Our approach utilizes an accuracy-aware loss function, enhanced with self-adaptive weights, to improve geometric quality significantly. We also propose a topology correction strategy that reduces the dependence on hyper-parameter, increasing the robustness of our method. Furthermore, we develop new operations leveraging self-adaptive weights to boost runtime efficiency. Extensive experiments on surface extraction across diverse datasets demonstrate that DCUDF2 outperforms DCUDF and existing methods in both geometric fidelity and topological accuracy. We will make the source code publicly available.
☆ UrBench: A Comprehensive Benchmark for Evaluating Large Multimodal Models in Multi-View Urban Scenarios
Recent evaluations of Large Multimodal Models (LMMs) have explored their capabilities in various domains, with only few benchmarks specifically focusing on urban environments. Moreover, existing urban benchmarks have been limited to evaluating LMMs with basic region-level urban tasks under singular views, leading to incomplete evaluations of LMMs' abilities in urban environments. To address these issues, we present UrBench, a comprehensive benchmark designed for evaluating LMMs in complex multi-view urban scenarios. UrBench contains 11.6K meticulously curated questions at both region-level and role-level that cover 4 task dimensions: Geo-Localization, Scene Reasoning, Scene Understanding, and Object Understanding, totaling 14 task types. In constructing UrBench, we utilize data from existing datasets and additionally collect data from 11 cities, creating new annotations using a cross-view detection-matching method. With these images and annotations, we then integrate LMM-based, rule-based, and human-based methods to construct large-scale high-quality questions. Our evaluations on 21 LMMs show that current LMMs struggle in the urban environments in several aspects. Even the best performing GPT-4o lags behind humans in most tasks, ranging from simple tasks such as counting to complex tasks such as orientation, localization and object attribute recognition, with an average performance gap of 17.4%. Our benchmark also reveals that LMMs exhibit inconsistent behaviors with different urban views, especially with respect to understanding cross-view relations. UrBench datasets and benchmark results will be publicly available at https://opendatalab.github.io/UrBench/.
comment: 9 pages, 6 figures
☆ VisionTS: Visual Masked Autoencoders Are Free-Lunch Zero-Shot Time Series Forecasters
Foundation models have emerged as a promising approach in time series forecasting (TSF). Existing approaches either fine-tune large language models (LLMs) or build large-scale time-series datasets to develop TSF foundation models. However, these methods face challenges due to the severe cross-domain gap or in-domain heterogeneity. In this paper, we explore a new road to building a TSF foundation model from rich and high-quality natural images, based on the intrinsic similarities between images and time series. To bridge the gap between the two domains, we reformulate the TSF task as an image reconstruction task, which is further processed by a visual masked autoencoder (MAE) self-supervised pre-trained on the ImageNet dataset. Surprisingly, without further adaptation in the time-series domain, the proposed VisionTS could achieve superior zero-shot forecasting performance compared to existing TSF foundation models. With minimal fine-tuning, VisionTS could further improve the forecasting and achieve state-of-the-art performance in most cases. These findings suggest that visual models could be a free lunch for TSF and highlight the potential for future cross-domain research between computer vision and TSF. Our code is publicly available at https://github.com/Keytoyze/VisionTS.
comment: 26 pages, 11 figures
☆ Abstracted Gaussian Prototypes for One-Shot Concept Learning
We introduce a cluster-based generative image segmentation framework to encode higher-level representations of visual concepts based on one-shot learning inspired by the Omniglot Challenge. The inferred parameters of each component of a Gaussian Mixture Model (GMM) represent a distinct topological subpart of a visual concept. Sampling new data from these parameters generates augmented subparts to build a more robust prototype for each concept, i.e., the Abstracted Gaussian Prototype (AGP). This framework addresses one-shot classification tasks using a cognitively-inspired similarity metric and addresses one-shot generative tasks through a novel AGP-VAE pipeline employing variational autoencoders (VAEs) to generate new class variants. Results from human judges reveal that the generative pipeline produces novel examples and classes of visual concepts that are broadly indistinguishable from those made by humans. The proposed framework leads to impressive but not state-of-the-art classification accuracy; thus, the contribution is two-fold: 1) the system is uniquely low in theoretical and computational complexity and operates in a completely standalone manner compared while existing approaches draw heavily on pre-training or knowledge engineering; and 2) in contrast with competing neural network models, the AGP approach addresses the importance of breadth of task capability emphasized in the Omniglot challenge (i.e., successful performance on generative tasks). These two points are critical as we advance toward an understanding of how learning/reasoning systems can produce viable, robust, and flexible concepts based on literally nothing more than a single example.
☆ A nonlinear elasticity model in computer vision
The purpose of this paper is to analyze a nonlinear elasticity model previously introduced by the authors for comparing two images, regarded as bounded open subsets of $\R^n$ together with associated vector-valued intensity maps. Optimal transformations between the images are sought as minimisers of an integral functional among orientation-preserving homeomorphisms. The existence of minimisers is proved under natural coercivity and polyconvexity conditions, assuming only that the intensity functions are bounded measurable. Variants of the existence theorem are also proved, first under the constraint that finite sets of landmark points in the two images are mapped one to the other, and second when one image is to be compared to an unknown part of another. The question is studied as to whether for images related by a linear mapping the unique minimizer is given by that linear mapping. For a natural class of functional integrands an example is given guaranteeing that this property holds for pairs of images in which the second is a scaling of the first by a constant factor. However for the property to hold for arbitrary pairs of linearly related images it is shown that the integrand has to depend on the gradient of the transformation as a convex function of its determinant alone. This suggests a new model in which the integrand depends also on second derivatives of the transformation, and an example is given for which both existence of minimizers is assured and the above property holds for all pairs of linearly related images.
☆ CondSeg: Ellipse Estimation of Pupil and Iris via Conditioned Segmentation
Parsing of eye components (i.e. pupil, iris and sclera) is fundamental for eye tracking and gaze estimation for AR/VR products. Mainstream approaches tackle this problem as a multi-class segmentation task, providing only visible part of pupil/iris, other methods regress elliptical parameters using human-annotated full pupil/iris parameters. In this paper, we consider two priors: projected full pupil/iris circle can be modelled with ellipses (ellipse prior), and the visibility of pupil/iris is controlled by openness of eye-region (condition prior), and design a novel method CondSeg to estimate elliptical parameters of pupil/iris directly from segmentation labels, without explicitly annotating full ellipses, and use eye-region mask to control the visibility of estimated pupil/iris ellipses. Conditioned segmentation loss is used to optimize the parameters by transforming parameterized ellipses into pixel-wise soft masks in a differentiable way. Our method is tested on public datasets (OpenEDS-2019/-2020) and shows competitive results on segmentation metrics, and provides accurate elliptical parameters for further applications of eye tracking simultaneously.
☆ OG-Mapping: Octree-based Structured 3D Gaussians for Online Dense Mapping
3D Gaussian splatting (3DGS) has recently demonstrated promising advancements in RGB-D online dense mapping. Nevertheless, existing methods excessively rely on per-pixel depth cues to perform map densification, which leads to significant redundancy and increased sensitivity to depth noise. Additionally, explicitly storing 3D Gaussian parameters of room-scale scene poses a significant storage challenge. In this paper, we introduce OG-Mapping, which leverages the robust scene structural representation capability of sparse octrees, combined with structured 3D Gaussian representations, to achieve efficient and robust online dense mapping. Moreover, OG-Mapping employs an anchor-based progressive map refinement strategy to recover the scene structures at multiple levels of detail. Instead of maintaining a small number of active keyframes with a fixed keyframe window as previous approaches do, a dynamic keyframe window is employed to allow OG-Mapping to better tackle false local minima and forgetting issues. Experimental results demonstrate that OG-Mapping delivers more robust and superior realism mapping results than existing Gaussian-based RGB-D online mapping methods with a compact model, and no additional post-processing is required.
☆ How Could Generative AI Support Compliance with the EU AI Act? A Review for Safe Automated Driving Perception
Deep Neural Networks (DNNs) have become central for the perception functions of autonomous vehicles, substantially enhancing their ability to understand and interpret the environment. However, these systems exhibit inherent limitations such as brittleness, opacity, and unpredictable behavior in out-of-distribution scenarios. The European Union (EU) Artificial Intelligence (AI) Act, as a pioneering legislative framework, aims to address these challenges by establishing stringent norms and standards for AI systems, including those used in autonomous driving (AD), which are categorized as high-risk AI. In this work, we explore how the newly available generative AI models can potentially support addressing upcoming regulatory requirements in AD perception, particularly with respect to safety. This short review paper summarizes the requirements arising from the EU AI Act regarding DNN-based perception systems and systematically categorizes existing generative AI applications in AD. While generative AI models show promise in addressing some of the EU AI Acts requirements, such as transparency and robustness, this review examines their potential benefits and discusses how developers could leverage these methods to enhance compliance with the Act. The paper also highlights areas where further research is needed to ensure reliable and safe integration of these technologies.
☆ NanoMVG: USV-Centric Low-Power Multi-Task Visual Grounding based on Prompt-Guided Camera and 4D mmWave Radar
Recently, visual grounding and multi-sensors setting have been incorporated into perception system for terrestrial autonomous driving systems and Unmanned Surface Vehicles (USVs), yet the high complexity of modern learning-based visual grounding model using multi-sensors prevents such model to be deployed on USVs in the real-life. To this end, we design a low-power multi-task model named NanoMVG for waterway embodied perception, guiding both camera and 4D millimeter-wave radar to locate specific object(s) through natural language. NanoMVG can perform both box-level and mask-level visual grounding tasks simultaneously. Compared to other visual grounding models, NanoMVG achieves highly competitive performance on the WaterVG dataset, particularly in harsh environments and boasts ultra-low power consumption for long endurance.
comment: 8 pages, 6 figures
☆ Covariance-corrected Whitening Alleviates Network Degeneration on Imbalanced Classification
Class imbalance is a critical issue in image classification that significantly affects the performance of deep recognition models. In this work, we first identify a network degeneration dilemma that hinders the model learning by introducing a high linear dependence among the features inputted into the classifier. To overcome this challenge, we propose a novel framework called Whitening-Net to mitigate the degenerate solutions, in which ZCA whitening is integrated before the linear classifier to normalize and decorrelate the batch samples. However, in scenarios with extreme class imbalance, the batch covariance statistic exhibits significant fluctuations, impeding the convergence of the whitening operation. Therefore, we propose two covariance-corrected modules, the Group-based Relatively Balanced Batch Sampler (GRBS) and the Batch Embedded Training (BET), to get more accurate and stable batch covariance, thereby reinforcing the capability of whitening. Our modules can be trained end-to-end without incurring substantial computational costs. Comprehensive empirical evaluations conducted on benchmark datasets, including CIFAR-LT-10/100, ImageNet-LT, and iNaturalist-LT, validate the effectiveness of our proposed approaches.
comment: 20 pages, 10 figures, 10 tables. arXiv admin note: text overlap with arXiv:2112.05958
☆ Hybrid Classification-Regression Adaptive Loss for Dense Object Detection
For object detection detectors, enhancing model performance hinges on the ability to simultaneously consider inconsistencies across tasks and focus on difficult-to-train samples. Achieving this necessitates incorporating information from both the classification and regression tasks. However, prior work tends to either emphasize difficult-to-train samples within their respective tasks or simply compute classification scores with IoU, often leading to suboptimal model performance. In this paper, we propose a Hybrid Classification-Regression Adaptive Loss, termed as HCRAL. Specifically, we introduce the Residual of Classification and IoU (RCI) module for cross-task supervision, addressing task inconsistencies, and the Conditioning Factor (CF) to focus on difficult-to-train samples within each task. Furthermore, we introduce a new strategy named Expanded Adaptive Training Sample Selection (EATSS) to provide additional samples that exhibit classification and regression inconsistencies. To validate the effectiveness of the proposed method, we conduct extensive experiments on COCO test-dev. Experimental evaluations demonstrate the superiority of our approachs. Additionally, we designed experiments by separately combining the classification and regression loss with regular loss functions in popular one-stage models, demonstrating improved performance.
☆ EMHI: A Multimodal Egocentric Human Motion Dataset with HMD and Body-Worn IMUs
Egocentric human pose estimation (HPE) using wearable sensors is essential for VR/AR applications. Most methods rely solely on either egocentric-view images or sparse Inertial Measurement Unit (IMU) signals, leading to inaccuracies due to self-occlusion in images or the sparseness and drift of inertial sensors. Most importantly, the lack of real-world datasets containing both modalities is a major obstacle to progress in this field. To overcome the barrier, we propose EMHI, a multimodal \textbf{E}gocentric human \textbf{M}otion dataset with \textbf{H}ead-Mounted Display (HMD) and body-worn \textbf{I}MUs, with all data collected under the real VR product suite. Specifically, EMHI provides synchronized stereo images from downward-sloping cameras on the headset and IMU data from body-worn sensors, along with pose annotations in SMPL format. This dataset consists of 885 sequences captured by 58 subjects performing 39 actions, totaling about 28.5 hours of recording. We evaluate the annotations by comparing them with optical marker-based SMPL fitting results. To substantiate the reliability of our dataset, we introduce MEPoser, a new baseline method for multimodal egocentric HPE, which employs a multimodal fusion encoder, temporal feature encoder, and MLP-based regression heads. The experiments on EMHI show that MEPoser outperforms existing single-modal methods and demonstrates the value of our dataset in solving the problem of egocentric HPE. We believe the release of EMHI and the method could advance the research of egocentric HPE and expedite the practical implementation of this technology in VR/AR products.
Self-supervised Anomaly Detection Pretraining Enhances Long-tail ECG Diagnosis
Current computer-aided ECG diagnostic systems struggle with the underdetection of rare but critical cardiac anomalies due to the imbalanced nature of ECG datasets. This study introduces a novel approach using self-supervised anomaly detection pretraining to address this limitation. The anomaly detection model is specifically designed to detect and localize subtle deviations from normal cardiac patterns, capturing the nuanced details essential for accurate ECG interpretation. Validated on an extensive dataset of over one million ECG records from clinical practice, characterized by a long-tail distribution across 116 distinct categories, the anomaly detection-pretrained ECG diagnostic model has demonstrated a significant improvement in overall accuracy. Notably, our approach yielded a 94.7% AUROC, 92.2% sensitivity, and 92.5\% specificity for rare ECG types, significantly outperforming traditional methods and narrowing the performance gap with common ECG types. The integration of anomaly detection pretraining into ECG analysis represents a substantial contribution to the field, addressing the long-standing challenge of long-tail data distributions in clinical diagnostics. Furthermore, prospective validation in real-world clinical settings revealed that our AI-driven approach enhances diagnostic efficiency, precision, and completeness by 32%, 6.7%, and 11.8% respectively, when compared to standard practices. This advancement marks a pivotal step forward in the integration of AI within clinical cardiology, with particularly profound implications for emergency care, where rapid and accurate ECG interpretation is crucial. The contributions of this study not only push the boundaries of current ECG diagnostic capabilities but also lay the groundwork for more reliable and accessible cardiovascular care.
comment: arXiv admin note: text overlap with arXiv:2404.04935
☆ Look, Compare, Decide: Alleviating Hallucination in Large Vision-Language Models via Multi-View Multi-Path Reasoning
Recently, Large Vision-Language Models (LVLMs) have demonstrated impressive capabilities in multi-modal context comprehension. However, they still suffer from hallucination problems referring to generating inconsistent outputs with the image content. To mitigate hallucinations, previous studies mainly focus on retraining LVLMs with custom datasets. Although effective, they inherently come with additional computational costs. In this paper, we propose a training-free framework, \textbf{MVP}, that aims to reduce hallucinations by making the most of the innate capabilities of the LVLMs via \textbf{M}ulti-\textbf{V}iew Multi-\textbf{P}ath Reasoning. Specifically, we first devise a multi-view information-seeking strategy to thoroughly perceive the comprehensive information in the image, which enriches the general global information captured by the original vision encoder in LVLMs. Furthermore, during the answer decoding, we observe that the occurrence of hallucinations has a strong correlation with the certainty of the answer tokens. Thus, we propose multi-path reasoning for each information view to quantify and aggregate the certainty scores for each potential answer among multiple decoding paths and finally decide the output answer. By fully grasping the information in the image and carefully considering the certainty of the potential answers when decoding, our MVP can effectively reduce hallucinations in LVLMs.The extensive experiments verify that our proposed MVP significantly mitigates the hallucination problem across four well-known LVLMs. The source code is available at: \url{https://github.com/GasolSun36/MVP}.
comment: 13 pages, 7 tables, 7 figures
☆ GMM-IKRS: Gaussian Mixture Models for Interpretable Keypoint Refinement and Scoring ECCV 2024
The extraction of keypoints in images is at the basis of many computer vision applications, from localization to 3D reconstruction. Keypoints come with a score permitting to rank them according to their quality. While learned keypoints often exhibit better properties than handcrafted ones, their scores are not easily interpretable, making it virtually impossible to compare the quality of individual keypoints across methods. We propose a framework that can refine, and at the same time characterize with an interpretable score, the keypoints extracted by any method. Our approach leverages a modified robust Gaussian Mixture Model fit designed to both reject non-robust keypoints and refine the remaining ones. Our score comprises two components: one relates to the probability of extracting the same keypoint in an image captured from another viewpoint, the other relates to the localization accuracy of the keypoint. These two interpretable components permit a comparison of individual keypoints extracted across different methods. Through extensive experiments we demonstrate that, when applied to popular keypoint detectors, our framework consistently improves the repeatability of keypoints as well as their performance in homography and two/multiple-view pose recovery tasks.
comment: Accepted at ECCV 2024
☆ RenDetNet: Weakly-supervised Shadow Detection with Shadow Caster Verification ECCV 2024
Existing shadow detection models struggle to differentiate dark image areas from shadows. In this paper, we tackle this issue by verifying that all detected shadows are real, i.e. they have paired shadow casters. We perform this step in a physically-accurate manner by differentiably re-rendering the scene and observing the changes stemming from carving out estimated shadow casters. Thanks to this approach, the RenDetNet proposed in this paper is the first learning-based shadow detection model whose supervisory signals can be computed in a self-supervised manner. The developed system compares favourably against recent models trained on our data. As part of this publication, we release our code on github.
comment: AIM @ ECCV 2024 / code available at https://github.com/n-kubiak/RenDetNet
☆ Temporal and Interactive Modeling for Efficient Human-Human Motion Generation
Human-human motion generation is essential for understanding humans as social beings. Although several transformer-based methods have been proposed, they typically model each individual separately and overlook the causal relationships in temporal motion sequences. Furthermore, the attention mechanism in transformers exhibits quadratic computational complexity, significantly reducing their efficiency when processing long sequences. In this paper, we introduce TIM (Temporal and Interactive Modeling), an efficient and effective approach that presents the pioneering human-human motion generation model utilizing RWKV. Specifically, we first propose Causal Interactive Injection to leverage the temporal properties of motion sequences and avoid non-causal and cumbersome modeling. Then we present Role-Evolving Mixing to adjust to the ever-evolving roles throughout the interaction. Finally, to generate smoother and more rational motion, we design Localized Pattern Amplification to capture short-term motion patterns. Extensive experiments on InterHuman demonstrate that our method achieves superior performance. Notably, TIM has achieved state-of-the-art results using only 32% of InterGen's trainable parameters. Code will be available soon. Homepage: https://aigc-explorer.github.io/TIM-page/
comment: Homepage: https://aigc-explorer.github.io/TIM-page/
☆ VQ4DiT: Efficient Post-Training Vector Quantization for Diffusion Transformers
The Diffusion Transformers Models (DiTs) have transitioned the network architecture from traditional UNets to transformers, demonstrating exceptional capabilities in image generation. Although DiTs have been widely applied to high-definition video generation tasks, their large parameter size hinders inference on edge devices. Vector quantization (VQ) can decompose model weight into a codebook and assignments, allowing extreme weight quantization and significantly reducing memory usage. In this paper, we propose VQ4DiT, a fast post-training vector quantization method for DiTs. We found that traditional VQ methods calibrate only the codebook without calibrating the assignments. This leads to weight sub-vectors being incorrectly assigned to the same assignment, providing inconsistent gradients to the codebook and resulting in a suboptimal result. To address this challenge, VQ4DiT calculates the candidate assignment set for each weight sub-vector based on Euclidean distance and reconstructs the sub-vector based on the weighted average. Then, using the zero-data and block-wise calibration method, the optimal assignment from the set is efficiently selected while calibrating the codebook. VQ4DiT quantizes a DiT XL/2 model on a single NVIDIA A100 GPU within 20 minutes to 5 hours depending on the different quantization settings. Experiments show that VQ4DiT establishes a new state-of-the-art in model size and performance trade-offs, quantizing weights to 2-bit precision while retaining acceptable image generation quality.
comment: 11 pages, 6 figures
☆ Multi-centric AI Model for Unruptured Intracranial Aneurysm Detection and Volumetric Segmentation in 3D TOF-MRI
Purpose: To develop an open-source nnU-Net-based AI model for combined detection and segmentation of unruptured intracranial aneurysms (UICA) in 3D TOF-MRI, and compare models trained on datasets with aneurysm-like differential diagnoses. Methods: This retrospective study (2020-2023) included 385 anonymized 3D TOF-MRI images from 364 patients (mean age 59 years, 60% female) at multiple centers plus 113 subjects from the ADAM challenge. Images featured untreated or possible UICAs and differential diagnoses. Four distinct training datasets were created, and the nnU-Net framework was used for model development. Performance was assessed on a separate test set using sensitivity and False Positive (FP)/case rate for detection, and DICE score and NSD (Normalized Surface Distance) with a 0.5mm threshold for segmentation. Statistical analysis included chi-square, Mann-Whitney-U, and Kruskal-Wallis tests, with significance set at p < 0.05. Results: Models achieved overall sensitivity between 82% and 85% and a FP/case rate of 0.20 to 0.31, with no significant differences (p = 0.90 and p = 0.16). The primary model showed 85% sensitivity and 0.23 FP/case rate, outperforming the ADAM-challenge winner (61%) and a nnU-Net trained on ADAM data (51%) in sensitivity (p < 0.05). It achieved a mean DICE score of 0.73 and an NSD of 0.84 for correctly detected UICA. Conclusions: Our open-source, nnU-Net-based AI model (available at 10.5281/zenodo.13386859) demonstrates high sensitivity, low false positive rates, and consistent segmentation accuracy for UICA detection and segmentation in 3D TOF-MRI, suggesting its potential to improve clinical diagnosis and for monitoring of UICA.
comment: 14 pages, 5 figures, 3 tables, 2 supplementary tables
☆ Sparse Uncertainty-Informed Sampling from Federated Streaming Data
We present a numerically robust, computationally efficient approach for non-I.I.D. data stream sampling in federated client systems, where resources are limited and labeled data for local model adaptation is sparse and expensive. The proposed method identifies relevant stream observations to optimize the underlying client model, given a local labeling budget, and performs instantaneous labeling decisions without relying on any memory buffering strategies. Our experiments show enhanced training batch diversity and an improved numerical robustness of the proposal compared to existing strategies over large-scale data streams, making our approach an effective and convenient solution in FL environments.
comment: Preprint, 6 pages, 3 figures, Accepted for ESANN 2024
☆ UTrack: Multi-Object Tracking with Uncertain Detections ECCV 2024
The tracking-by-detection paradigm is the mainstream in multi-object tracking, associating tracks to the predictions of an object detector. Although exhibiting uncertainty through a confidence score, these predictions do not capture the entire variability of the inference process. For safety and security critical applications like autonomous driving, surveillance, etc., knowing this predictive uncertainty is essential though. Therefore, we introduce, for the first time, a fast way to obtain the empirical predictive distribution during object detection and incorporate that knowledge in multi-object tracking. Our mechanism can easily be integrated into state-of-the-art trackers, enabling them to fully exploit the uncertainty in the detections. Additionally, novel association methods are introduced that leverage the proposed mechanism. We demonstrate the effectiveness of our contribution on a variety of benchmarks, such as MOT17, MOT20, DanceTrack, and KITTI.
comment: Accepted for the ECCV 2024 Workshop on Uncertainty Quantification for Computer Vision
☆ RISSOLE: Parameter-efficient Diffusion Models via Block-wise Generation and Retrieval-Guidance
Diffusion-based models demonstrate impressive generation capabilities. However, they also have a massive number of parameters, resulting in enormous model sizes, thus making them unsuitable for deployment on resource-constraint devices. Block-wise generation can be a promising alternative for designing compact-sized (parameter-efficient) deep generative models since the model can generate one block at a time instead of generating the whole image at once. However, block-wise generation is also considerably challenging because ensuring coherence across generated blocks can be non-trivial. To this end, we design a retrieval-augmented generation (RAG) approach and leverage the corresponding blocks of the images retrieved by the RAG module to condition the training and generation stages of a block-wise denoising diffusion model. Our conditioning schemes ensure coherence across the different blocks during training and, consequently, during generation. While we showcase our approach using the latent diffusion model (LDM) as the base model, it can be used with other variants of denoising diffusion models. We validate the solution of the coherence problem through the proposed approach by reporting substantive experiments to demonstrate our approach's effectiveness in compact model size and excellent generation quality.
☆ FissionVAE: Federated Non-IID Image Generation with Latent Space and Decoder Decomposition
Federated learning is a machine learning paradigm that enables decentralized clients to collaboratively learn a shared model while keeping all the training data local. While considerable research has focused on federated image generation, particularly Generative Adversarial Networks, Variational Autoencoders have received less attention. In this paper, we address the challenges of non-IID (independently and identically distributed) data environments featuring multiple groups of images of different types. Specifically, heterogeneous data distributions can lead to difficulties in maintaining a consistent latent space and can also result in local generators with disparate texture features being blended during aggregation. We introduce a novel approach, FissionVAE, which decomposes the latent space and constructs decoder branches tailored to individual client groups. This method allows for customized learning that aligns with the unique data distributions of each group. Additionally, we investigate the incorporation of hierarchical VAE architectures and demonstrate the use of heterogeneous decoder architectures within our model. We also explore strategies for setting the latent prior distributions to enhance the decomposition process. To evaluate our approach, we assemble two composite datasets: the first combines MNIST and FashionMNIST; the second comprises RGB datasets of cartoon and human faces, wild animals, marine vessels, and remote sensing images of Earth. Our experiments demonstrate that FissionVAE greatly improves generation quality on these datasets compared to baseline federated VAE models.
☆ Focus-Consistent Multi-Level Aggregation for Compositional Zero-Shot Learning
To transfer knowledge from seen attribute-object compositions to recognize unseen ones, recent compositional zero-shot learning (CZSL) methods mainly discuss the optimal classification branches to identify the elements, leading to the popularity of employing a three-branch architecture. However, these methods mix up the underlying relationship among the branches, in the aspect of consistency and diversity. Specifically, consistently providing the highest-level features for all three branches increases the difficulty in distinguishing classes that are superficially similar. Furthermore, a single branch may focus on suboptimal regions when spatial messages are not shared between the personalized branches. Recognizing these issues and endeavoring to address them, we propose a novel method called Focus-Consistent Multi-Level Aggregation (FOMA). Our method incorporates a Multi-Level Feature Aggregation (MFA) module to generate personalized features for each branch based on the image content. Additionally, a Focus-Consistent Constraint encourages a consistent focus on the informative regions, thereby implicitly exchanging spatial information between all branches. Extensive experiments on three benchmark datasets (UT-Zappos, C-GQA, and Clothing16K) demonstrate that our FOMA outperforms SOTA.
comment: Compositional Zero-Shot Learning
☆ Stochastic Layer-Wise Shuffle: A Good Practice to Improve Vision Mamba Training
Recent Vision Mamba models not only have much lower complexity for processing higher resolution images and longer videos but also the competitive performance with Vision Transformers (ViTs). However, they are stuck into overfitting and thus only present up to base size (about 80M). It is still unclear how vanilla Vision Mamba (Vim) can be efficiently scaled up to larger sizes, which is essentially for further exploitation. In this paper, we propose a stochastic layer-wise shuffle regularization, which empowers successfully scaling non-hierarchical Vision Mamba to a large size (about 300M) in a supervised setting. Specifically, our base and large-scale ShuffleMamba models can outperform the supervised ViTs of similar size by 0.8\% and 1.0\% classification accuracy on ImageNet1k, respectively, without auxiliary data. When evaluated on the ADE20K semantic segmentation and COCO detection tasks, our ShuffleMamba models also show significant improvements. Without bells and whistles, the stochastic layer-wise shuffle has the following highlights: (1) \textit{Plug and play:} it does not change model architectures and will be omitted in inference. (2) \textit{Simple but effective:} it can improve the overfitting in Vim training and only introduce random token permutation operations. (3) \textit{Intuitive:} the token sequences in deeper layers are more likely to be shuffled as they are expected to be more semantic and less sensitive to patch positions. Code and models will be available at https://github.com/huangzizheng01/ShuffleMamba.
☆ Approximately Invertible Neural Network for Learned Image Compression
Learned image compression have attracted considerable interests in recent years. It typically comprises an analysis transform, a synthesis transform, quantization and an entropy coding model. The analysis transform and synthesis transform are used to encode an image to latent feature and decode the quantized feature to reconstruct the image, and can be regarded as coupled transforms. However, the analysis transform and synthesis transform are designed independently in the existing methods, making them unreliable in high-quality image compression. Inspired by the invertible neural networks in generative modeling, invertible modules are used to construct the coupled analysis and synthesis transforms. Considering the noise introduced in the feature quantization invalidates the invertible process, this paper proposes an Approximately Invertible Neural Network (A-INN) framework for learned image compression. It formulates the rate-distortion optimization in lossy image compression when using INN with quantization, which differentiates from using INN for generative modelling. Generally speaking, A-INN can be used as the theoretical foundation for any INN based lossy compression method. Based on this formulation, A-INN with a progressive denoising module (PDM) is developed to effectively reduce the quantization noise in the decoding. Moreover, a Cascaded Feature Recovery Module (CFRM) is designed to learn high-dimensional feature recovery from low-dimensional ones to further reduce the noise in feature channel compression. In addition, a Frequency-enhanced Decomposition and Synthesis Module (FDSM) is developed by explicitly enhancing the high-frequency components in an image to address the loss of high-frequency information inherent in neural network based image compression. Extensive experiments demonstrate that the proposed A-INN outperforms the existing learned image compression methods.
☆ Generalizing Deepfake Video Detection with Plug-and-Play: Video-Level Blending and Spatiotemporal Adapter Tuning
Three key challenges hinder the development of current deepfake video detection: (1) Temporal features can be complex and diverse: how can we identify general temporal artifacts to enhance model generalization? (2) Spatiotemporal models often lean heavily on one type of artifact and ignore the other: how can we ensure balanced learning from both? (3) Videos are naturally resource-intensive: how can we tackle efficiency without compromising accuracy? This paper attempts to tackle the three challenges jointly. First, inspired by the notable generality of using image-level blending data for image forgery detection, we investigate whether and how video-level blending can be effective in video. We then perform a thorough analysis and identify a previously underexplored temporal forgery artifact: Facial Feature Drift (FFD), which commonly exists across different forgeries. To reproduce FFD, we then propose a novel Video-level Blending data (VB), where VB is implemented by blending the original image and its warped version frame-by-frame, serving as a hard negative sample to mine more general artifacts. Second, we carefully design a lightweight Spatiotemporal Adapter (StA) to equip a pretrained image model (both ViTs and CNNs) with the ability to capture both spatial and temporal features jointly and efficiently. StA is designed with two-stream 3D-Conv with varying kernel sizes, allowing it to process spatial and temporal features separately. Extensive experiments validate the effectiveness of the proposed methods; and show our approach can generalize well to previously unseen forgery videos, even the just-released (in 2024) SoTAs. We release our code and pretrained weights at \url{https://github.com/YZY-stack/StA4Deepfake}.
☆ Instant Adversarial Purification with Adversarial Consistency Distillation
Neural networks, despite their remarkable performance in widespread applications, including image classification, are also known to be vulnerable to subtle adversarial noise. Although some diffusion-based purification methods have been proposed, for example, DiffPure, those methods are time-consuming. In this paper, we propose One Step Control Purification (OSCP), a diffusion-based purification model that can purify the adversarial image in one Neural Function Evaluation (NFE) in diffusion models. We use Latent Consistency Model (LCM) and ControlNet for our one-step purification. OSCP is computationally friendly and time efficient compared to other diffusion-based purification methods; we achieve defense success rate of 74.19\% on ImageNet, only requiring 0.1s for each purification. Moreover, there is a fundamental incongruence between consistency distillation and adversarial perturbation. To address this ontological dissonance, we propose Gaussian Adversarial Noise Distillation (GAND), a novel consistency distillation framework that facilitates a more nuanced reconciliation of the latent space dynamics, effectively bridging the natural and adversarial manifolds. Our experiments show that the GAND does not need a Full Fine Tune (FFT); PEFT, e.g., LoRA is sufficient.
☆ Vote&Mix: Plug-and-Play Token Reduction for Efficient Vision Transformer
Despite the remarkable success of Vision Transformers (ViTs) in various visual tasks, they are often hindered by substantial computational cost. In this work, we introduce Vote\&Mix (\textbf{VoMix}), a plug-and-play and parameter-free token reduction method, which can be readily applied to off-the-shelf ViT models \textit{without any training}. VoMix tackles the computational redundancy of ViTs by identifying tokens with high homogeneity through a layer-wise token similarity voting mechanism. Subsequently, the selected tokens are mixed into the retained set, thereby preserving visual information. Experiments demonstrate VoMix significantly improves the speed-accuracy tradeoff of ViTs on both images and videos. Without any training, VoMix achieves a 2$\times$ increase in throughput of existing ViT-H on ImageNet-1K and a 2.4$\times$ increase in throughput of existing ViT-L on Kinetics-400 video dataset, with a mere 0.3\% drop in top-1 accuracy.
☆ Efficient Image Restoration through Low-Rank Adaptation and Stable Diffusion XL
In this study, we propose an enhanced image restoration model, SUPIR, based on the integration of two low-rank adaptive (LoRA) modules with the Stable Diffusion XL (SDXL) framework. Our method leverages the advantages of LoRA to fine-tune SDXL models, thereby significantly improving image restoration quality and efficiency. We collect 2600 high-quality real-world images, each with detailed descriptive text, for training the model. The proposed method is evaluated on standard benchmarks and achieves excellent performance, demonstrated by higher peak signal-to-noise ratio (PSNR), lower learned perceptual image patch similarity (LPIPS), and higher structural similarity index measurement (SSIM) scores. These results underscore the effectiveness of combining LoRA with SDXL for advanced image restoration tasks, highlighting the potential of our approach in generating high-fidelity restored images.
comment: 10 pages
☆ A Survey of the Self Supervised Learning Mechanisms for Vision Transformers
Deep supervised learning models require high volume of labeled data to attain sufficiently good results. Although, the practice of gathering and annotating such big data is costly and laborious. Recently, the application of self supervised learning (SSL) in vision tasks has gained significant attention. The intuition behind SSL is to exploit the synchronous relationships within the data as a form of self-supervision, which can be versatile. In the current big data era, most of the data is unlabeled, and the success of SSL thus relies in finding ways to improve this vast amount of unlabeled data available. Thus its better for deep learning algorithms to reduce reliance on human supervision and instead focus on self-supervision based on the inherent relationships within the data. With the advent of ViTs, which have achieved remarkable results in computer vision, it is crucial to explore and understand the various SSL mechanisms employed for training these models specifically in scenarios where there is less label data available. In this survey we thus develop a comprehensive taxonomy of systematically classifying the SSL techniques based upon their representations and pre-training tasks being applied. Additionally, we discuss the motivations behind SSL, review popular pre-training tasks, and highlight the challenges and advancements in this field. Furthermore, we present a comparative analysis of different SSL methods, evaluate their strengths and limitations, and identify potential avenues for future research.
comment: 34 Pages, 5 Figures, 7 Tables
☆ LAR-IQA: A Lightweight, Accurate, and Robust No-Reference Image Quality Assessment Model
Recent advancements in the field of No-Reference Image Quality Assessment (NR-IQA) using deep learning techniques demonstrate high performance across multiple open-source datasets. However, such models are typically very large and complex making them not so suitable for real-world deployment, especially on resource- and battery-constrained mobile devices. To address this limitation, we propose a compact, lightweight NR-IQA model that achieves state-of-the-art (SOTA) performance on ECCV AIM UHD-IQA challenge validation and test datasets while being also nearly 5.7 times faster than the fastest SOTA model. Our model features a dual-branch architecture, with each branch separately trained on synthetically and authentically distorted images which enhances the model's generalizability across different distortion types. To improve robustness under diverse real-world visual conditions, we additionally incorporate multiple color spaces during the training process. We also demonstrate the higher accuracy of recently proposed Kolmogorov-Arnold Networks (KANs) for final quality regression as compared to the conventional Multi-Layer Perceptrons (MLPs). Our evaluation considering various open-source datasets highlights the practical, high-accuracy, and robust performance of our proposed lightweight model. Code: https://github.com/nasimjamshidi/LAR-IQA.
☆ BTMuda: A Bi-level Multi-source unsupervised domain adaptation framework for breast cancer diagnosis
Deep learning has revolutionized the early detection of breast cancer, resulting in a significant decrease in mortality rates. However, difficulties in obtaining annotations and huge variations in distribution between training sets and real scenes have limited their clinical applications. To address these limitations, unsupervised domain adaptation (UDA) methods have been used to transfer knowledge from one labeled source domain to the unlabeled target domain, yet these approaches suffer from severe domain shift issues and often ignore the potential benefits of leveraging multiple relevant sources in practical applications. To address these limitations, in this work, we construct a Three-Branch Mixed extractor and propose a Bi-level Multi-source unsupervised domain adaptation method called BTMuda for breast cancer diagnosis. Our method addresses the problems of domain shift by dividing domain shift issues into two levels: intra-domain and inter-domain. To reduce the intra-domain shift, we jointly train a CNN and a Transformer as two paths of a domain mixed feature extractor to obtain robust representations rich in both low-level local and high-level global information. As for the inter-domain shift, we redesign the Transformer delicately to a three-branch architecture with cross-attention and distillation, which learns domain-invariant representations from multiple domains. Besides, we introduce two alignment modules - one for feature alignment and one for classifier alignment - to improve the alignment process. Extensive experiments conducted on three public mammographic datasets demonstrate that our BTMuda outperforms state-of-the-art methods.
☆ Can We Leave Deepfake Data Behind in Training Deepfake Detector?
The generalization ability of deepfake detectors is vital for their applications in real-world scenarios. One effective solution to enhance this ability is to train the models with manually-blended data, which we termed "blendfake", encouraging models to learn generic forgery artifacts like blending boundary. Interestingly, current SoTA methods utilize blendfake without incorporating any deepfake data in their training process. This is likely because previous empirical observations suggest that vanilla hybrid training (VHT), which combines deepfake and blendfake data, results in inferior performance to methods using only blendfake data (so-called "1+1<2"). Therefore, a critical question arises: Can we leave deepfake behind and rely solely on blendfake data to train an effective deepfake detector? Intuitively, as deepfakes also contain additional informative forgery clues (e.g., deep generative artifacts), excluding all deepfake data in training deepfake detectors seems counter-intuitive. In this paper, we rethink the role of blendfake in detecting deepfakes and formulate the process from "real to blendfake to deepfake" to be a progressive transition. Specifically, blendfake and deepfake can be explicitly delineated as the oriented pivot anchors between "real-to-fake" transitions. The accumulation of forgery information should be oriented and progressively increasing during this transition process. To this end, we propose an Oriented Progressive Regularizor (OPR) to establish the constraints that compel the distribution of anchors to be discretely arranged. Furthermore, we introduce feature bridging to facilitate the smooth transition between adjacent anchors. Extensive experiments confirm that our design allows leveraging forgery information from both blendfake and deepfake effectively and comprehensively.
☆ Text-to-Image Generation Via Energy-Based CLIP
Joint Energy Models (JEMs), while drawing significant research attention, have not been successfully scaled to real-world, high-resolution datasets. We present EB-CLIP, a novel approach extending JEMs to the multimodal vision-language domain using CLIP, integrating both generative and discriminative objectives. For the generative objective, we introduce an image-text joint-energy function based on Cosine similarity in the CLIP space, training CLIP to assign low energy to real image-caption pairs and high energy otherwise. For the discriminative objective, we employ contrastive adversarial loss, extending the adversarial training objective to the multimodal domain. EB-CLIP not only generates realistic images from text but also achieves competitive results on the compositionality benchmark, outperforming leading methods with fewer parameters. Additionally, we demonstrate the superior guidance capability of EB-CLIP by enhancing CLIP-based generative frameworks and converting unconditional diffusion models to text-based ones. Lastly, we show that EB-CLIP can serve as a more robust evaluation metric for text-to-image generative tasks than CLIP.
☆ CP-VoteNet: Contrastive Prototypical VoteNet for Few-Shot Point Cloud Object Detection
Few-shot point cloud 3D object detection (FS3D) aims to identify and localise objects of novel classes from point clouds, using knowledge learnt from annotated base classes and novel classes with very few annotations. Thus far, this challenging task has been approached using prototype learning, but the performance remains far from satisfactory. We find that in existing methods, the prototypes are only loosely constrained and lack of fine-grained awareness of the semantic and geometrical correlation embedded within the point cloud space. To mitigate these issues, we propose to leverage the inherent contrastive relationship within the semantic and geometrical subspaces to learn more refined and generalisable prototypical representations. To this end, we first introduce contrastive semantics mining, which enables the network to extract discriminative categorical features by constructing positive and negative pairs within training batches. Meanwhile, since point features representing local patterns can be clustered into geometric components, we further propose to impose contrastive relationship at the primitive level. Through refined primitive geometric structures, the transferability of feature encoding from base to novel classes is significantly enhanced. The above designs and insights lead to our novel Contrastive Prototypical VoteNet (CP-VoteNet). Extensive experiments on two FS3D benchmarks FS-ScanNet and FS-SUNRGBD demonstrate that CP-VoteNet surpasses current state-of-the-art methods by considerable margins across different FS3D settings. Further ablation studies conducted corroborate the rationale and effectiveness of our designs.
comment: Accepted by PRCV 2024
☆ ConDense: Consistent 2D/3D Pre-training for Dense and Sparse Features from Multi-View Images ECCV 2024
To advance the state of the art in the creation of 3D foundation models, this paper introduces the ConDense framework for 3D pre-training utilizing existing pre-trained 2D networks and large-scale multi-view datasets. We propose a novel 2D-3D joint training scheme to extract co-embedded 2D and 3D features in an end-to-end pipeline, where 2D-3D feature consistency is enforced through a volume rendering NeRF-like ray marching process. Using dense per pixel features we are able to 1) directly distill the learned priors from 2D models to 3D models and create useful 3D backbones, 2) extract more consistent and less noisy 2D features, 3) formulate a consistent embedding space where 2D, 3D, and other modalities of data (e.g., natural language prompts) can be jointly queried. Furthermore, besides dense features, ConDense can be trained to extract sparse features (e.g., key points), also with 2D-3D consistency -- condensing 3D NeRF representations into compact sets of decorated key points. We demonstrate that our pre-trained model provides good initialization for various 3D tasks including 3D classification and segmentation, outperforming other 3D pre-training methods by a significant margin. It also enables, by exploiting our sparse features, additional useful downstream tasks, such as matching 2D images to 3D scenes, detecting duplicate 3D scenes, and querying a repository of 3D scenes through natural language -- all quite efficiently and without any per-scene fine-tuning.
comment: ECCV 2024
☆ Disease Classification and Impact of Pretrained Deep Convolution Neural Networks on Diverse Medical Imaging Datasets across Imaging Modalities
Imaging techniques such as Chest X-rays, whole slide images, and optical coherence tomography serve as the initial screening and detection for a wide variety of medical pulmonary and ophthalmic conditions respectively. This paper investigates the intricacies of using pretrained deep convolutional neural networks with transfer learning across diverse medical imaging datasets with varying modalities for binary and multiclass classification. We conducted a comprehensive performance analysis with ten network architectures and model families each with pretraining and random initialization. Our finding showed that the use of pretrained models as fixed feature extractors yields poor performance irrespective of the datasets. Contrary, histopathology microscopy whole slide images have better performance. It is also found that deeper and more complex architectures did not necessarily result in the best performance. This observation implies that the improvements in ImageNet are not parallel to the medical imaging tasks. Within a medical domain, the performance of the network architectures varies within model families with shifts in datasets. This indicates that the performance of models within a specific modality may not be conclusive for another modality within the same domain. This study provides a deeper understanding of the applications of deep learning techniques in medical imaging and highlights the impact of pretrained networks across different medical imaging datasets under five different experimental settings.
comment: 15 pages, 3 figures, 4 tables
☆ Retrieval-Augmented Natural Language Reasoning for Explainable Visual Question Answering ICIP
Visual Question Answering with Natural Language Explanation (VQA-NLE) task is challenging due to its high demand for reasoning-based inference. Recent VQA-NLE studies focus on enhancing model networks to amplify the model's reasoning capability but this approach is resource-consuming and unstable. In this work, we introduce a new VQA-NLE model, ReRe (Retrieval-augmented natural language Reasoning), using leverage retrieval information from the memory to aid in generating accurate answers and persuasive explanations without relying on complex networks and extra datasets. ReRe is an encoder-decoder architecture model using a pre-trained clip vision encoder and a pre-trained GPT-2 language model as a decoder. Cross-attention layers are added in the GPT-2 for processing retrieval features. ReRe outperforms previous methods in VQA accuracy and explanation score and shows improvement in NLE with more persuasive, reliability.
comment: ICIP Workshop 2024
☆ Efficient Camera Exposure Control for Visual Odometry via Deep Reinforcement Learning
The stability of visual odometry (VO) systems is undermined by degraded image quality, especially in environments with significant illumination changes. This study employs a deep reinforcement learning (DRL) framework to train agents for exposure control, aiming to enhance imaging performance in challenging conditions. A lightweight image simulator is developed to facilitate the training process, enabling the diversification of image exposure and sequence trajectory. This setup enables completely offline training, eliminating the need for direct interaction with camera hardware and the real environments. Different levels of reward functions are crafted to enhance the VO systems, equipping the DRL agents with varying intelligence. Extensive experiments have shown that our exposure control agents achieve superior efficiency-with an average inference duration of 1.58 ms per frame on a CPU-and respond more quickly than traditional feedback control schemes. By choosing an appropriate reward function, agents acquire an intelligent understanding of motion trends and anticipate future illumination changes. This predictive capability allows VO systems to deliver more stable and precise odometry results. The codes and datasets are available at https://github.com/ShuyangUni/drl_exposure_ctrl.
comment: 8 pages, 7 figures
☆ AdaptVision: Dynamic Input Scaling in MLLMs for Versatile Scene Understanding
Over the past few years, the advancement of Multimodal Large Language Models (MLLMs) has captured the wide interest of researchers, leading to numerous innovations to enhance MLLMs' comprehension. In this paper, we present AdaptVision, a multimodal large language model specifically designed to dynamically process input images at varying resolutions. We hypothesize that the requisite number of visual tokens for the model is contingent upon both the resolution and content of the input image. Generally, natural images with a lower information density can be effectively interpreted by the model using fewer visual tokens at reduced resolutions. In contrast, images containing textual content, such as documents with rich text, necessitate a higher number of visual tokens for accurate text interpretation due to their higher information density. Building on this insight, we devise a dynamic image partitioning module that adjusts the number of visual tokens according to the size and aspect ratio of images. This method mitigates distortion effects that arise from resizing images to a uniform resolution and dynamically optimizing the visual tokens input to the LLMs. Our model is capable of processing images with resolutions up to $1008\times 1008$. Extensive experiments across various datasets demonstrate that our method achieves impressive performance in handling vision-language tasks in both natural and text-related scenes. The source code and dataset are now publicly available at \url{https://github.com/harrytea/AdaptVision}.
☆ 2DGH: 2D Gaussian-Hermite Splatting for High-quality Rendering and Better Geometry Reconstruction
2D Gaussian Splatting has recently emerged as a significant method in 3D reconstruction, enabling novel view synthesis and geometry reconstruction simultaneously. While the well-known Gaussian kernel is broadly used, its lack of anisotropy and deformation ability leads to dim and vague edges at object silhouettes, limiting the reconstruction quality of current Gaussian splatting methods. To enhance the representation power, we draw inspiration from quantum physics and propose to use the Gaussian-Hermite kernel as the new primitive in Gaussian splatting. The new kernel takes a unified mathematical form and extends the Gaussian function, which serves as the zero-rank term in the updated formulation. Our experiments demonstrate the extraordinary performance of Gaussian-Hermite kernel in both geometry reconstruction and novel-view synthesis tasks. The proposed kernel outperforms traditional Gaussian Splatting kernels, showcasing its potential for high-quality 3D reconstruction and rendering.
☆ Cross Fusion RGB-T Tracking with Bi-directional Adapter
Many state-of-the-art RGB-T trackers have achieved remarkable results through modality fusion. However, these trackers often either overlook temporal information or fail to fully utilize it, resulting in an ineffective balance between multi-modal and temporal information. To address this issue, we propose a novel Cross Fusion RGB-T Tracking architecture (CFBT) that ensures the full participation of multiple modalities in tracking while dynamically fusing temporal information. The effectiveness of CFBT relies on three newly designed cross spatio-temporal information fusion modules: Cross Spatio-Temporal Augmentation Fusion (CSTAF), Cross Spatio-Temporal Complementarity Fusion (CSTCF), and Dual-Stream Spatio-Temporal Adapter (DSTA). CSTAF employs a cross-attention mechanism to enhance the feature representation of the template comprehensively. CSTCF utilizes complementary information between different branches to enhance target features and suppress background features. DSTA adopts the adapter concept to adaptively fuse complementary information from multiple branches within the transformer layer, using the RGB modality as a medium. These ingenious fusions of multiple perspectives introduce only less than 0.3\% of the total modal parameters, but they indeed enable an efficient balance between multi-modal and temporal information. Extensive experiments on three popular RGB-T tracking benchmarks demonstrate that our method achieves new state-of-the-art performance.
☆ Synthetic Lunar Terrain: A Multimodal Open Dataset for Training and Evaluating Neuromorphic Vision Algorithms
Synthetic Lunar Terrain (SLT) is an open dataset collected from an analogue test site for lunar missions, featuring synthetic craters in a high-contrast lighting setup. It includes several side-by-side captures from event-based and conventional RGB cameras, supplemented with a high-resolution 3D laser scan for depth estimation. The event-stream recorded from the neuromorphic vision sensor of the event-based camera is of particular interest as this emerging technology provides several unique advantages, such as high data rates, low energy consumption and resilience towards scenes of high dynamic range. SLT provides a solid foundation to analyse the limits of RGB-cameras and potential advantages or synergies in utilizing neuromorphic visions with the goal of enabling and improving lunar specific applications like rover navigation, landing in cratered environments or similar.
comment: 7 pages, 5 figures, to be published at "International Symposium on Artificial Intelligence, Robotics and Automation in Space, i-SAIRAS, 2024
☆ Contrastive Learning with Synthetic Positives
Contrastive learning with the nearest neighbor has proved to be one of the most efficient self-supervised learning (SSL) techniques by utilizing the similarity of multiple instances within the same class. However, its efficacy is constrained as the nearest neighbor algorithm primarily identifies ``easy'' positive pairs, where the representations are already closely located in the embedding space. In this paper, we introduce a novel approach called Contrastive Learning with Synthetic Positives (CLSP) that utilizes synthetic images, generated by an unconditional diffusion model, as the additional positives to help the model learn from diverse positives. Through feature interpolation in the diffusion model sampling process, we generate images with distinct backgrounds yet similar semantic content to the anchor image. These images are considered ``hard'' positives for the anchor image, and when included as supplementary positives in the contrastive loss, they contribute to a performance improvement of over 2\% and 1\% in linear evaluation compared to the previous NNCLR and All4One methods across multiple benchmark datasets such as CIFAR10, achieving state-of-the-art methods. On transfer learning benchmarks, CLSP outperforms existing SSL frameworks on 6 out of 8 downstream datasets. We believe CLSP establishes a valuable baseline for future SSL studies incorporating synthetic data in the training process.
comment: 8 pages, conference
☆ Causal Representation-Based Domain Generalization on Gaze Estimation
The availability of extensive datasets containing gaze information for each subject has significantly enhanced gaze estimation accuracy. However, the discrepancy between domains severely affects a model's performance explicitly trained for a particular domain. In this paper, we propose the Causal Representation-Based Domain Generalization on Gaze Estimation (CauGE) framework designed based on the general principle of causal mechanisms, which is consistent with the domain difference. We employ an adversarial training manner and an additional penalizing term to extract domain-invariant features. After extracting features, we position the attention layer to make features sufficient for inferring the actual gaze. By leveraging these modules, CauGE ensures that the neural networks learn from representations that meet the causal mechanisms' general principles. By this, CauGE generalizes across domains by extracting domain-invariant features, and spurious correlations cannot influence the model. Our method achieves state-of-the-art performance in the domain generalization on gaze estimation benchmark.
☆ HiTSR: A Hierarchical Transformer for Reference-based Super-Resolution
In this paper, we propose HiTSR, a hierarchical transformer model for reference-based image super-resolution, which enhances low-resolution input images by learning matching correspondences from high-resolution reference images. Diverging from existing multi-network, multi-stage approaches, we streamline the architecture and training pipeline by incorporating the double attention block from GAN literature. Processing two visual streams independently, we fuse self-attention and cross-attention blocks through a gating attention strategy. The model integrates a squeeze-and-excitation module to capture global context from the input images, facilitating long-range spatial interactions within window-based attention blocks. Long skip connections between shallow and deep layers further enhance information flow. Our model demonstrates superior performance across three datasets including SUN80, Urban100, and Manga109. Specifically, on the SUN80 dataset, our model achieves PSNR/SSIM values of 30.24/0.821. These results underscore the effectiveness of attention mechanisms in reference-based image super-resolution. The transformer-based model attains state-of-the-art results without the need for purpose-built subnetworks, knowledge distillation, or multi-stage training, emphasizing the potency of attention in meeting reference-based image super-resolution requirements.
comment: arXiv admin note: text overlap with arXiv:2307.08837
☆ Transient Fault Tolerant Semantic Segmentation for Autonomous Driving ECCV 2024
Deep learning models are crucial for autonomous vehicle perception, but their reliability is challenged by algorithmic limitations and hardware faults. We address the latter by examining fault-tolerance in semantic segmentation models. Using established hardware fault models, we evaluate existing hardening techniques both in terms of accuracy and uncertainty and introduce ReLUMax, a novel simple activation function designed to enhance resilience against transient faults. ReLUMax integrates seamlessly into existing architectures without time overhead. Our experiments demonstrate that ReLUMax effectively improves robustness, preserving performance and boosting prediction confidence, thus contributing to the development of reliable autonomous driving systems.
comment: Accepted ECCV 2024 UnCV Workshop - https://github.com/iurada/neutron-segmentation
♻ ☆ Frankenstein: Generating Semantic-Compositional 3D Scenes in One Tri-Plane SIGGRAPH
We present Frankenstein, a diffusion-based framework that can generate semantic-compositional 3D scenes in a single pass. Unlike existing methods that output a single, unified 3D shape, Frankenstein simultaneously generates multiple separated shapes, each corresponding to a semantically meaningful part. The 3D scene information is encoded in one single tri-plane tensor, from which multiple Singed Distance Function (SDF) fields can be decoded to represent the compositional shapes. During training, an auto-encoder compresses tri-planes into a latent space, and then the denoising diffusion process is employed to approximate the distribution of the compositional scenes. Frankenstein demonstrates promising results in generating room interiors as well as human avatars with automatically separated parts. The generated scenes facilitate many downstream applications, such as part-wise re-texturing, object rearrangement in the room or avatar cloth re-targeting. Our project page is available at: https://wolfball.github.io/frankenstein/.
comment: SIGGRAPH Asia 2024 Conference Paper
♻ ☆ RT-GS2: Real-Time Generalizable Semantic Segmentation for 3D Gaussian Representations of Radiance Fields BMVC 2024
Gaussian Splatting has revolutionized the world of novel view synthesis by achieving high rendering performance in real-time. Recently, studies have focused on enriching these 3D representations with semantic information for downstream tasks. In this paper, we introduce RT-GS2, the first generalizable semantic segmentation method employing Gaussian Splatting. While existing Gaussian Splatting-based approaches rely on scene-specific training, RT-GS2 demonstrates the ability to generalize to unseen scenes. Our method adopts a new approach by first extracting view-independent 3D Gaussian features in a self-supervised manner, followed by a novel View-Dependent / View-Independent (VDVI) feature fusion to enhance semantic consistency over different views. Extensive experimentation on three different datasets showcases RT-GS2's superiority over the state-of-the-art methods in semantic segmentation quality, exemplified by a 8.01% increase in mIoU on the Replica dataset. Moreover, our method achieves real-time performance of 27.03 FPS, marking an astonishing 901 times speedup compared to existing approaches. This work represents a significant advancement in the field by introducing, to the best of our knowledge, the first real-time generalizable semantic segmentation method for 3D Gaussian representations of radiance fields.
comment: Accepted paper at BMVC 2024
♻ ☆ A Permuted Autoregressive Approach to Word-Level Recognition for Urdu Digital Text
This research paper introduces a novel word-level Optical Character Recognition (OCR) model specifically designed for digital Urdu text, leveraging transformer-based architectures and attention mechanisms to address the distinct challenges of Urdu script recognition, including its diverse text styles, fonts, and variations. The model employs a permuted autoregressive sequence (PARSeq) architecture, which enhances its performance by enabling context-aware inference and iterative refinement through the training of multiple token permutations. This method allows the model to adeptly manage character reordering and overlapping characters, commonly encountered in Urdu script. Trained on a dataset comprising approximately 160,000 Urdu text images, the model demonstrates a high level of accuracy in capturing the intricacies of Urdu script, achieving a CER of 0.178. Despite ongoing challenges in handling certain text variations, the model exhibits superior accuracy and effectiveness in practical applications. Future work will focus on refining the model through advanced data augmentation techniques and the integration of context-aware language models to further enhance its performance and robustness in Urdu text recognition.
♻ ☆ DeformGS: Scene Flow in Highly Deformable Scenes for Deformable Object Manipulation
Teaching robots to fold, drape, or reposition deformable objects such as cloth will unlock a variety of automation applications. While remarkable progress has been made for rigid object manipulation, manipulating deformable objects poses unique challenges, including frequent occlusions, infinite-dimensional state spaces and complex dynamics. Just as object pose estimation and tracking have aided robots for rigid manipulation, dense 3D tracking (scene flow) of highly deformable objects will enable new applications in robotics while aiding existing approaches, such as imitation learning or creating digital twins with real2sim transfer. We propose DeformGS, an approach to recover scene flow in highly deformable scenes, using simultaneous video captures of a dynamic scene from multiple cameras. DeformGS builds on recent advances in Gaussian splatting, a method that learns the properties of a large number of Gaussians for state-of-the-art and fast novel-view synthesis. DeformGS learns a deformation function to project a set of Gaussians with canonical properties into world space. The deformation function uses a neural-voxel encoding and a multilayer perceptron (MLP) to infer Gaussian position, rotation, and a shadow scalar. We enforce physics-inspired regularization terms based on conservation of momentum and isometry, which leads to trajectories with smaller trajectory errors. We also leverage existing foundation models SAM and XMEM to produce noisy masks, and learn a per-Gaussian mask for better physics-inspired regularization. DeformGS achieves high-quality 3D tracking on highly deformable scenes with shadows and occlusions. In experiments, DeformGS improves 3D tracking by an average of 55.8% compared to the state-of-the-art. With sufficient texture, DeformGS achieves a median tracking error of 3.3 mm on a cloth of 1.5 x 1.5 m in area. Website: https://deformgs.github.io
♻ ☆ OpticalRS-4M: Scaling Efficient Masked Autoencoder Learning on Large Remote Sensing Dataset
Masked Image Modeling (MIM) has become an essential method for building foundational visual models in remote sensing (RS). However, the limitations in size and diversity of existing RS datasets restrict the ability of MIM methods to learn generalizable representations. Additionally, conventional MIM techniques, which require reconstructing all tokens, introduce unnecessary computational overhead. To address these issues, we present a new pre-training pipeline for RS models, featuring the creation of a large-scale RS dataset and an efficient MIM approach. We curated a high-quality dataset named OpticalRS-4M by collecting publicly available RS datasets and processing them through exclusion, slicing, and deduplication. OpticalRS-4M comprises 4 million optical images covering various RS tasks, such as object detection and pixel segmentation. To enhance efficiency, we propose SelectiveMAE, a pre-training method that dynamically encodes and reconstructs semantically rich patch tokens, thereby reducing the inefficiencies of traditional MIM models caused by redundant background pixels in RS images. Extensive experiments demonstrate that OpticalRS-4M significantly improves classification, detection, and segmentation performance, while SelectiveMAE increases training efficiency over 2 times. This highlights the effectiveness and scalability of our pipeline in developing RS foundational models.
♻ ☆ Docling Technical Report
This technical report introduces Docling, an easy to use, self-contained, MIT-licensed open-source package for PDF document conversion. It is powered by state-of-the-art specialized AI models for layout analysis (DocLayNet) and table structure recognition (TableFormer), and runs efficiently on commodity hardware in a small resource budget. The code interface allows for easy extensibility and addition of new features and models.
♻ ☆ Foundational Models for Pathology and Endoscopy Images: Application for Gastric Inflammation
The integration of artificial intelligence (AI) in medical diagnostics represents a significant advancement in managing upper gastrointestinal (GI) cancer, a major cause of global cancer mortality. Specifically for gastric cancer (GC), chronic inflammation causes changes in the mucosa such as atrophy, intestinal metaplasia (IM), dysplasia and ultimately cancer. Early detection through endoscopic regular surveillance is essential for better outcomes. Foundation models (FM), which are machine or deep learning models trained on diverse data and applicable to broad use cases, offer a promising solution to enhance the accuracy of endoscopy and its subsequent pathology image analysis. This review explores the recent advancements, applications, and challenges associated with FM in endoscopy and pathology imaging. We started by elucidating the core principles and architectures underlying these models, including their training methodologies and the pivotal role of large-scale data in developing their predictive capabilities. Moreover, this work discusses emerging trends and future research directions, emphasizing the integration of multimodal data, the development of more robust and equitable models, and the potential for real-time diagnostic support. This review aims to provide a roadmap for researchers and practitioners in navigating the complexities of incorporating FM into clinical practice for prevention/management of GC cases, thereby improving patient outcomes.
♻ ☆ DreamPhysics: Learning Physical Properties of Dynamic 3D Gaussians with Video Diffusion Priors
Dynamic 3D interaction has been attracting a lot of attention recently. However, creating such 4D content remains challenging. One solution is to animate 3D scenes with physics-based simulation, which requires manually assigning precise physical properties to the object or the simulated results would become unnatural. Another solution is to learn the deformation of 3D objects with the distillation of video generative models, which, however, tends to produce 3D videos with small and discontinuous motions due to the inappropriate extraction and application of physical prior. In this work, combining the strengths and complementing shortcomings of the above two solutions, we propose to learn the physical properties of a material field with video diffusion priors, and then utilize a physics-based Material-Point-Method (MPM) simulator to generate 4D content with realistic motions. In particular, we propose motion distillation sampling to emphasize video motion information during distillation. Moreover, to facilitate the optimization, we further propose a KAN-based material field with frame boosting. Experimental results demonstrate that our method enjoys more realistic motion than state-of-the-arts. Codes are released at: https://github.com/tyhuang0428/DreamPhysics.
comment: Codes are released at: https://github.com/tyhuang0428/DreamPhysics
♻ ☆ L4DR: LiDAR-4DRadar Fusion for Weather-Robust 3D Object Detection
LiDAR-based vision systems are integral for 3D object detection, which is crucial for autonomous navigation. However, they suffer from performance degradation in adverse weather conditions due to the quality deterioration of LiDAR point clouds. Fusing LiDAR with the weather-robust 4D radar sensor is expected to solve this problem. However, the fusion of LiDAR and 4D radar is challenging because they differ significantly in terms of data quality and the degree of degradation in adverse weather. To address these issues, we introduce L4DR, a weather-robust 3D object detection method that effectively achieves LiDAR and 4D Radar fusion. Our L4DR includes Multi-Modal Encoding (MME) and Foreground-Aware Denoising (FAD) technique to reconcile sensor gaps, which is the first exploration of the complementarity of early fusion between LiDAR and 4D radar. Additionally, we design an Inter-Modal and Intra-Modal ({IM}2 ) parallel feature extraction backbone coupled with a Multi-Scale Gated Fusion (MSGF) module to counteract the varying degrees of sensor degradation under adverse weather conditions. Experimental evaluation on a VoD dataset with simulated fog proves that L4DR is more adaptable to changing weather conditions. It delivers a significant performance increase under different fog levels, improving the 3D mAP by up to 20.0% over the traditional LiDAR-only approach. Moreover, the results on the K-Radar dataset validate the consistent performance improvement of L4DR in real-world adverse weather conditions.
♻ ☆ GeoMeter: Probing Depth and Height Perception of Large Visual-Language Models
Geometric understanding is crucial for navigating and interacting with our environment. While large Vision Language Models (VLMs) demonstrate impressive capabilities, deploying them in real-world scenarios necessitates a comparable geometric understanding in visual perception. In this work, we focus on the geometric comprehension of these models; specifically targeting the depths and heights of objects within a scene. Our observations reveal that, although VLMs excel in basic geometric properties perception such as shape and size, they encounter significant challenges in reasoning about the depth and height of objects. To address this, we introduce GeoMeter, a suite of benchmark datasets encompassing Synthetic 2D, Synthetic 3D, and Real-World scenarios to rigorously evaluate these aspects. We benchmark 17 state-of-the-art VLMs using these datasets and find that they consistently struggle with both depth and height perception. Our key insights include detailed analyses of the shortcomings in depth and height reasoning capabilities of VLMs and the inherent bias present in these models. This study aims to pave the way for the development of VLMs with enhanced geometric understanding, crucial for real-world applications.
♻ ☆ Revisiting 360 Depth Estimation with PanoGabor: A New Fusion Perspective
Depth estimation from a monocular 360 image is important to the perception of the entire 3D environment. However, the inherent distortion and large field of view (FoV) in 360 images pose great challenges for this task. To this end, existing mainstream solutions typically introduce additional perspective-based 360 representations (\textit{e.g.}, Cubemap) to achieve effective feature extraction. Nevertheless, regardless of the introduced representations, they eventually need to be unified into the equirectangular projection (ERP) format for the subsequent depth estimation, which inevitably reintroduces the troublesome distortions. In this work, we propose an oriented distortion-aware Gabor Fusion framework (PGFuse) to address the above challenges. First, we introduce Gabor filters that analyze texture in the frequency domain, thereby extending the receptive fields and enhancing depth cues. To address the reintroduced distortions, we design a linear latitude-aware distortion representation method to generate customized, distortion-aware Gabor filters (PanoGabor filters). Furthermore, we design a channel-wise and spatial-wise unidirectional fusion module (CS-UFM) that integrates the proposed PanoGabor filters to unify other representations into the ERP format, delivering effective and distortion-free features. Considering the orientation sensitivity of the Gabor transform, we introduce a spherical gradient constraint to stabilize this sensitivity. Experimental results on three popular indoor 360 benchmarks demonstrate the superiority of the proposed PGFuse to existing state-of-the-art solutions. Code can be available upon acceptance.
♻ ☆ Object-Centric Diffusion for Efficient Video Editing ECCV24
Diffusion-based video editing have reached impressive quality and can transform either the global style, local structure, and attributes of given video inputs, following textual edit prompts. However, such solutions typically incur heavy memory and computational costs to generate temporally-coherent frames, either in the form of diffusion inversion and/or cross-frame attention. In this paper, we conduct an analysis of such inefficiencies, and suggest simple yet effective modifications that allow significant speed-ups whilst maintaining quality. Moreover, we introduce Object-Centric Diffusion, to fix generation artifacts and further reduce latency by allocating more computations towards foreground edited regions, arguably more important for perceptual quality. We achieve this by two novel proposals: i) Object-Centric Sampling, decoupling the diffusion steps spent on salient or background regions and spending most on the former, and ii) Object-Centric Token Merging, which reduces cost of cross-frame attention by fusing redundant tokens in unimportant background regions. Both techniques are readily applicable to a given video editing model without retraining, and can drastically reduce its memory and computational cost. We evaluate our proposals on inversion-based and control-signal-based editing pipelines, and show a latency reduction up to 10x for a comparable synthesis quality. Project page: qualcomm-ai-research.github.io/object-centric-diffusion.
comment: ECCV24
♻ ☆ CaFNet: A Confidence-Driven Framework for Radar Camera Depth Estimation IROS 2024
Depth estimation is critical in autonomous driving for interpreting 3D scenes accurately. Recently, radar-camera depth estimation has become of sufficient interest due to the robustness and low-cost properties of radar. Thus, this paper introduces a two-stage, end-to-end trainable Confidence-aware Fusion Net (CaFNet) for dense depth estimation, combining RGB imagery with sparse and noisy radar point cloud data. The first stage addresses radar-specific challenges, such as ambiguous elevation and noisy measurements, by predicting a radar confidence map and a preliminary coarse depth map. A novel approach is presented for generating the ground truth for the confidence map, which involves associating each radar point with its corresponding object to identify potential projection surfaces. These maps, together with the initial radar input, are processed by a second encoder. For the final depth estimation, we innovate a confidence-aware gated fusion mechanism to integrate radar and image features effectively, thereby enhancing the reliability of the depth map by filtering out radar noise. Our methodology, evaluated on the nuScenes dataset, demonstrates superior performance, improving upon the current leading model by 3.2% in Mean Absolute Error (MAE) and 2.7% in Root Mean Square Error (RMSE). Code: https://github.com/harborsarah/CaFNet
comment: Accepted by IROS 2024
♻ ☆ Addressing the challenges of loop detection in agricultural environments
While visual SLAM systems are well studied and achieve impressive results in indoor and urban settings, natural, outdoor and open-field environments are much less explored and still present relevant research challenges. Visual navigation and local mapping have shown a relatively good performance in open-field environments. However, globally consistent mapping and long-term localization still depend on the robustness of loop detection and closure, for which the literature is scarce. In this work we propose a novel method to pave the way towards robust loop detection in open fields, particularly in agricultural settings, based on local feature search and stereo geometric refinement, with a final stage of relative pose estimation. Our method consistently achieves good loop detections, with a median error of 15cm. We aim to characterize open fields as a novel environment for loop detection, understanding the limitations and problems that arise when dealing with them.
♻ ☆ Fast Fishing: Approximating BAIT for Efficient and Scalable Deep Active Image Classification ECML
Deep active learning (AL) seeks to minimize the annotation costs for training deep neural networks. BAIT, a recently proposed AL strategy based on the Fisher Information, has demonstrated impressive performance across various datasets. However, BAIT's high computational and memory requirements hinder its applicability on large-scale classification tasks, resulting in current research neglecting BAIT in their evaluation. This paper introduces two methods to enhance BAIT's computational efficiency and scalability. Notably, we significantly reduce its time complexity by approximating the Fisher Information. In particular, we adapt the original formulation by i) taking the expectation over the most probable classes, and ii) constructing a binary classification task, leading to an alternative likelihood for gradient computations. Consequently, this allows the efficient use of BAIT on large-scale datasets, including ImageNet. Our unified and comprehensive evaluation across a variety of datasets demonstrates that our approximations achieve strong performance with considerably reduced time complexity. Furthermore, we provide an extensive open-source toolbox that implements recent state-of-the-art AL strategies, available at https://github.com/dhuseljic/dal-toolbox.
comment: Accepted at ECML PKDD 2024
♻ ☆ GlyphDraw2: Automatic Generation of Complex Glyph Posters with Diffusion Models and Large Language Models
Posters play a crucial role in marketing and advertising by enhancing visual communication and brand visibility, making significant contributions to industrial design. With the latest advancements in controllable T2I diffusion models, increasing research has focused on rendering text within synthesized images. Despite improvements in text rendering accuracy, the field of automatic poster generation remains underexplored. In this paper, we propose an automatic poster generation framework with text rendering capabilities leveraging LLMs, utilizing a triple-cross attention mechanism based on alignment learning. This framework aims to create precise poster text within a detailed contextual background. Additionally, the framework supports controllable fonts, adjustable image resolution, and the rendering of posters with descriptions and text in both English and Chinese.Furthermore, we introduce a high-resolution font dataset and a poster dataset with resolutions exceeding 1024 pixels. Our approach leverages the SDXL architecture. Extensive experiments validate our method's capability in generating poster images with complex and contextually rich backgrounds.Codes is available at https://github.com/OPPO-Mente-Lab/GlyphDraw2.
♻ ☆ Large coordinate kernel attention network for lightweight image super-resolution
The multi-scale receptive field and large kernel attention (LKA) module have been shown to significantly improve performance in the lightweight image super-resolution task. However, existing lightweight super-resolution (SR) methods seldom pay attention to designing efficient building block with multi-scale receptive field for local modeling, and their LKA modules face a quadratic increase in computational and memory footprints as the convolutional kernel size increases. To address the first issue, we propose the multi-scale blueprint separable convolutions (MBSConv) as highly efficient building block with multi-scale receptive field, it can focus on the learning for the multi-scale information which is a vital component of discriminative representation. As for the second issue, we revisit the key properties of LKA in which we find that the adjacent direct interaction of local information and long-distance dependencies is crucial to provide remarkable performance. Thus, taking this into account and in order to mitigate the complexity of LKA, we propose a large coordinate kernel attention (LCKA) module which decomposes the 2D convolutional kernels of the depth-wise convolutional layers in LKA into horizontal and vertical 1-D kernels. LCKA enables the adjacent direct interaction of local information and long-distance dependencies not only in the horizontal direction but also in the vertical. Besides, LCKA allows for the direct use of extremely large kernels in the depth-wise convolutional layers to capture more contextual information, which helps to significantly improve the reconstruction performance, and it incurs lower computational complexity and memory footprints. Integrating MBSConv and LCKA, we propose a large coordinate kernel attention network (LCAN).
comment: 13 pages
♻ ☆ Improving Online Source-free Domain Adaptation for Object Detection by Unsupervised Data Acquisition ECCV
Effective object detection in autonomous vehicles is challenged by deployment in diverse and unfamiliar environments. Online Source-Free Domain Adaptation (O-SFDA) offers model adaptation using a stream of unlabeled data from a target domain in an online manner. However, not all captured frames contain information beneficial for adaptation, especially in the presence of redundant data and class imbalance issues. This paper introduces a novel approach to enhance O-SFDA for adaptive object detection through unsupervised data acquisition. Our methodology prioritizes the most informative unlabeled frames for inclusion in the online training process. Empirical evaluation on a real-world dataset reveals that our method outperforms existing state-of-the-art O-SFDA techniques, demonstrating the viability of unsupervised data acquisition for improving the adaptive object detector.
comment: Accepted by ECCV workshop ROAM 2024; 12 pages, 2 figures
♻ ☆ CathAction: A Benchmark for Endovascular Intervention Understanding
Real-time visual feedback from catheterization analysis is crucial for enhancing surgical safety and efficiency during endovascular interventions. However, existing datasets are often limited to specific tasks, small scale, and lack the comprehensive annotations necessary for broader endovascular intervention understanding. To tackle these limitations, we introduce CathAction, a large-scale dataset for catheterization understanding. Our CathAction dataset encompasses approximately 500,000 annotated frames for catheterization action understanding and collision detection, and 25,000 ground truth masks for catheter and guidewire segmentation. For each task, we benchmark recent related works in the field. We further discuss the challenges of endovascular intentions compared to traditional computer vision tasks and point out open research questions. We hope that CathAction will facilitate the development of endovascular intervention understanding methods that can be applied to real-world applications. The dataset is available at https://airvlab.github.io/cathaction/.
comment: 10 pages. Webpage: https://airvlab.github.io/cathaction/
♻ ☆ Human-Free Automated Prompting for Vision-Language Anomaly Detection: Prompt Optimization with Meta-guiding Prompt Scheme
Pre-trained vision-language models (VLMs) are highly adaptable to various downstream tasks through few-shot learning, making prompt-based anomaly detection a promising approach. Traditional methods depend on human-crafted prompts that require prior knowledge of specific anomaly types. Our goal is to develop a human-free prompt-based anomaly detection framework that optimally learns prompts through data-driven methods, eliminating the need for human intervention. The primary challenge in this approach is the lack of anomalous samples during the training phase. Additionally, the Vision Transformer (ViT)-based image encoder in VLMs is not ideal for pixel-wise anomaly segmentation due to a locality feature mismatch between the original image and the output feature map. To tackle the first challenge, we have developed the Object-Attention Anomaly Generation Module (OAGM) to synthesize anomaly samples for training. Furthermore, our Meta-Guiding Prompt-Tuning Scheme (MPTS) iteratively adjusts the gradient-based optimization direction of learnable prompts to avoid overfitting to the synthesized anomalies. For the second challenge, we propose Locality-Aware Attention, which ensures that each local patch feature attends only to nearby patch features, preserving the locality features corresponding to their original locations. This framework allows for the optimal prompt embeddings by searching in the continuous latent space via backpropagation, free from human semantic constraints. Additionally, the modified locality-aware attention improves the precision of pixel-wise anomaly segmentation.
♻ ☆ Many-Worlds Inverse Rendering
Discontinuous visibility changes remain a major bottleneck when optimizing surfaces within a physically-based inverse renderer. Many previous works have proposed sophisticated algorithms and data structures to sample visibility silhouettes more efficiently. Our work presents another solution: instead of differentiating a tentative surface locally, we differentiate a volumetric perturbation of a surface. We refer this as a many-worlds representation because it models a non-interacting superposition of conflicting explanations (worlds) of the input dataset. Each world is optically isolated from others, leading to a new transport law that distinguishes our method from prior work based on exponential random media. The resulting Monte Carlo algorithm is simpler and more efficient than prior methods. We demonstrate that our method promotes rapid convergence, both in terms of the total iteration count and the cost per iteration.
♻ ☆ RoadRunner -- Learning Traversability Estimation for Autonomous Off-road Driving
Autonomous navigation at high speeds in off-road environments necessitates robots to comprehensively understand their surroundings using onboard sensing only. The extreme conditions posed by the off-road setting can cause degraded camera image quality due to poor lighting and motion blur, as well as limited sparse geometric information available from LiDAR sensing when driving at high speeds. In this work, we present RoadRunner, a novel framework capable of predicting terrain traversability and an elevation map directly from camera and LiDAR sensor inputs. RoadRunner enables reliable autonomous navigation, by fusing sensory information, handling of uncertainty, and generation of contextually informed predictions about the geometry and traversability of the terrain while operating at low latency. In contrast to existing methods relying on classifying handcrafted semantic classes and using heuristics to predict traversability costs, our method is trained end-to-end in a self-supervised fashion. The RoadRunner network architecture builds upon popular sensor fusion network architectures from the autonomous driving domain, which embed LiDAR and camera information into a common Bird's Eye View perspective. Training is enabled by utilizing an existing traversability estimation stack to generate training data in hindsight in a scalable manner from real-world off-road driving datasets. Furthermore, RoadRunner improves the system latency by a factor of roughly 4, from 500 ms to 140 ms, while improving the accuracy for traversability costs and elevation map predictions. We demonstrate the effectiveness of RoadRunner in enabling safe and reliable off-road navigation at high speeds in multiple real-world driving scenarios through unstructured desert environments.
comment: accepted for IEEE Transactions on Field Robotics (T-FR)
♻ ☆ Evidential Deep Partial Multi-View Classification With Discount Fusion
Incomplete multi-view data classification poses significant challenges due to the common issue of missing views in real-world scenarios. Despite advancements, existing methods often fail to provide reliable predictions, largely due to the uncertainty of missing views and the inconsistent quality of imputed data. To tackle these problems, we propose a novel framework called Evidential Deep Partial Multi-View Classification (EDP-MVC). Initially, we use K-means imputation to address missing views, creating a complete set of multi-view data. However, the potential conflicts and uncertainties within this imputed data can affect the reliability of downstream inferences. To manage this, we introduce a Conflict-Aware Evidential Fusion Network (CAEFN), which dynamically adjusts based on the reliability of the evidence, ensuring trustworthy discount fusion and producing reliable inference outcomes. Comprehensive experiments on various benchmark datasets reveal EDP-MVC not only matches but often surpasses the performance of state-of-the-art methods.
comment: Ongoing work. 13 pages, 3 figures, 6 tables
♻ ☆ Eyes Can Deceive: Benchmarking Counterfactual Reasoning Abilities of Multi-modal Large Language Models
Counterfactual reasoning, as a crucial manifestation of human intelligence, refers to making presuppositions based on established facts and extrapolating potential outcomes. Existing multimodal large language models (MLLMs) have exhibited impressive cognitive and reasoning capabilities, which have been examined across a wide range of Visual Question Answering (VQA) benchmarks. Nevertheless, how will existing MLLMs perform when faced with counterfactual questions? To answer this question, we first curate a novel \textbf{C}ounter\textbf{F}actual \textbf{M}ulti\textbf{M}odal reasoning benchmark, abbreviated as \textbf{CFMM}, to systematically assess the counterfactual reasoning capabilities of MLLMs. Our CFMM comprises six challenging tasks, each including hundreds of carefully human-labeled and GPT-generated counterfactual questions, to evaluate MLLM's counterfactual reasoning capabilities across diverse aspects. Through experiments, interestingly, we find that existing MLLMs prefer to believe what they see, but ignore the counterfactual presuppositions presented in the question, thereby leading to inaccurate responses. Furthermore, we evaluate a wide range of prevalent MLLMs on our proposed CFMM. The significant gap between their performance on our CFMM and that on several VQA benchmarks indicates that there is still considerable room for improvement in existing MLLMs toward approaching human-level intelligence. On the other hand, through boosting MLLMs performances on our CFMM in the future, potential avenues toward developing MLLMs with advanced intelligence can be explored.
♻ ☆ Early Explorations of Lightweight Models for Wound Segmentation on Mobile Devices
The aging population poses numerous challenges to healthcare, including the increase in chronic wounds in the elderly. The current approach to wound assessment by therapists based on photographic documentation is subjective, highlighting the need for computer-aided wound recognition from smartphone photos. This offers objective and convenient therapy monitoring, while being accessible to patients from their home at any time. However, despite research in mobile image segmentation, there is a lack of focus on mobile wound segmentation. To address this gap, we conduct initial research on three lightweight architectures to investigate their suitability for smartphone-based wound segmentation. Using public datasets and UNet as a baseline, our results are promising, with both ENet and TopFormer, as well as the larger UNeXt variant, showing comparable performance to UNet. Furthermore, we deploy the models into a smartphone app for visual assessment of live segmentation, where results demonstrate the effectiveness of TopFormer in distinguishing wounds from wound-coloured objects. While our study highlights the potential of transformer models for mobile wound segmentation, future work should aim to further improve the mask contours.
comment: Extended version of our paper that was published in the "47th German Conference on Artificial Intelligence (KI 2024)"
♻ ☆ Deep Convolutional Framelet Denoising for Panoramic by Mixed Wavelet Integration
Enhancing quality and removing noise during preprocessing is one of the most critical steps in image processing. X-ray images are created by photons colliding with atoms and the variation in scattered noise absorption. This noise leads to a deterioration in the graph's medical quality and, at times, results in repetition, thereby increasing the patient's effective dose. One of the most critical challenges in this area has consistently been lowering the image noise. Techniques like BM3d, low-pass filters, and Autoencoder have taken this step. Owing to their structural design and high rate of repetition, neural networks employing diverse architectures have, over the past decade, achieved noise reduction with satisfactory outcomes, surpassing the traditional BM3D and low-pass filters. The combination of the Hankel matrix with neural networks represents one of these configurations. The Hankel matrix aims to identify a local circle by separating individual values into local and non-local components, utilizing a non-local matrix. A non-local matrix can be created using the wave or DCT. This paper suggests integrating the waveform with the Daubechies (D4) wavelet due to its higher energy concentration and employs the u-Net neural network architecture, which incorporates the waveform exclusively at each stage. The outcomes were evaluated using the PSNR and SSIM criteria, and the outcomes were verified by using various waves. The effectiveness of a one-wave network has increased from 0.5% to 1.2%, according to studies done on other datasets
♻ ☆ TSAR-MVS: Textureless-aware Segmentation and Correlative Refinement Guided Multi-View Stereo
The reconstruction of textureless areas has long been a challenging problem in MVS due to lack of reliable pixel correspondences between images. In this paper, we propose the Textureless-aware Segmentation And Correlative Refinement guided Multi-View Stereo (TSAR-MVS), a novel method that effectively tackles challenges posed by textureless areas in 3D reconstruction through filtering, refinement and segmentation. First, we implement the joint hypothesis filtering, a technique that merges a confidence estimator with a disparity discontinuity detector to eliminate incorrect depth estimations. Second, to spread the pixels with confident depth, we introduce an iterative correlation refinement strategy that leverages RANSAC to generate 3D planes based on superpixels, succeeded by a weighted median filter for broadening the influence of accurately determined pixels. Finally, we present a textureless-aware segmentation method that leverages edge detection and line detection for accurately identify large textureless regions for further depth completion. Experiments on ETH3D, Tanks & Temples and Strecha datasets demonstrate the superior performance and strong generalization capability of our proposed method.
♻ ☆ MSP-MVS: Multi-granularity Segmentation Prior Guided Multi-View Stereo
Reconstructing textureless areas in MVS poses challenges due to the absence of reliable pixel correspondences within fixed patch. Although certain methods employ patch deformation to expand the receptive field, their patches mistakenly skip depth edges to calculate areas with depth discontinuity, thereby causing ambiguity. Consequently, we introduce Multi-granularity Segmentation Prior Multi-View Stereo (MSP-MVS). Specifically, we first propose multi-granularity segmentation prior by integrating multi-granularity depth edges to restrict patch deformation within homogeneous areas. Moreover, we present anchor equidistribution that bring deformed patches with more uniformly distributed anchors to ensure an adequate coverage of their own homogeneous areas. Furthermore, we introduce iterative local search optimization to represent larger patch with sparse representative candidates, significantly boosting the expressive capacity for each patch. The state-of-the-art results on ETH3D and Tanks & Temples benchmarks demonstrate the effectiveness and robust generalization ability of our proposed method.
comment: arXiv admin note: text overlap with arXiv:2308.09990
♻ ☆ Tailoring Adversarial Attacks on Deep Neural Networks for Targeted Class Manipulation Using DeepFool Algorithm
The susceptibility of deep neural networks (DNNs) to adversarial attacks undermines their reliability across numerous applications, underscoring the necessity for an in-depth exploration of these vulnerabilities and the formulation of robust defense strategies. The DeepFool algorithm by Moosavi-Dezfooli et al. (2016) represents a pivotal step in identifying minimal perturbations required to induce misclassification of input images. Nonetheless, its generic methodology falls short in scenarios necessitating targeted interventions. Additionally, previous research studies have predominantly concentrated on the success rate of attacks without adequately addressing the consequential distortion of images, the maintenance of image quality, or the confidence threshold required for misclassification. To bridge these gaps, we introduce the Enhanced Targeted DeepFool (ET DeepFool) algorithm, an evolution of DeepFool that not only facilitates the specification of desired misclassification targets but also incorporates a configurable minimum confidence score. Our empirical investigations demonstrate the superiority of this refined approach in maintaining the integrity of images and minimizing perturbations across a variety of DNN architectures. Unlike previous iterations, such as the Targeted DeepFool by Gajjar et al. (2022), our method grants unparalleled control over the perturbation process, enabling precise manipulation of model responses. Preliminary outcomes reveal that certain models, including AlexNet and the advanced Vision Transformer, display commendable robustness to such manipulations. This discovery of varying levels of model robustness, as unveiled through our confidence level adjustments, could have far-reaching implications for the field of image recognition. Our code will be made public upon acceptance of the paper.
comment: 18 pages, 5 figures
♻ ☆ Zero-Shot Multi-Object Scene Completion ECCV 2024
We present a 3D scene completion method that recovers the complete geometry of multiple unseen objects in complex scenes from a single RGB-D image. Despite notable advancements in single-object 3D shape completion, high-quality reconstructions in highly cluttered real-world multi-object scenes remains a challenge. To address this issue, we propose OctMAE, an architecture that leverages an Octree U-Net and a latent 3D MAE to achieve high-quality and near real-time multi-object scene completion through both local and global geometric reasoning. Because a naive 3D MAE can be computationally intractable and memory intensive even in the latent space, we introduce a novel occlusion masking strategy and adopt 3D rotary embeddings, which significantly improves the runtime and scene completion quality. To generalize to a wide range of objects in diverse scenes, we create a large-scale photorealistic dataset, featuring a diverse set of 12K 3D object models from the Objaverse dataset which are rendered in multi-object scenes with physics-based positioning. Our method outperforms the current state-of-the-art on both synthetic and real-world datasets and demonstrates a strong zero-shot capability.
comment: Published at ECCV 2024, Webpage: https://sh8.io/#/oct_mae
♻ ☆ Does CLIP Bind Concepts? Probing Compositionality in Large Image Models
Large-scale neural network models combining text and images have made incredible progress in recent years. However, it remains an open question to what extent such models encode compositional representations of the concepts over which they operate, such as correctly identifying "red cube" by reasoning over the constituents "red" and "cube". In this work, we focus on the ability of a large pretrained vision and language model (CLIP) to encode compositional concepts and to bind variables in a structure-sensitive way (e.g., differentiating "cube behind sphere" from "sphere behind cube"). To inspect the performance of CLIP, we compare several architectures from research on compositional distributional semantics models (CDSMs), a line of research that attempts to implement traditional compositional linguistic structures within embedding spaces. We benchmark them on three synthetic datasets - single-object, two-object, and relational - designed to test concept binding. We find that CLIP can compose concepts in a single-object setting, but in situations where concept binding is needed, performance drops dramatically. At the same time, CDSMs also perform poorly, with best performance at chance level.
comment: Lewis and Nayak contributed equally
♻ ☆ 3D Weakly Supervised Semantic Segmentation with 2D Vision-Language Guidance
In this paper, we propose 3DSS-VLG, a weakly supervised approach for 3D Semantic Segmentation with 2D Vision-Language Guidance, an alternative approach that a 3D model predicts dense-embedding for each point which is co-embedded with both the aligned image and text spaces from the 2D vision-language model. Specifically, our method exploits the superior generalization ability of the 2D vision-language models and proposes the Embeddings Soft-Guidance Stage to utilize it to implicitly align 3D embeddings and text embeddings. Moreover, we introduce the Embeddings Specialization Stage to purify the feature representation with the help of a given scene-level label, specifying a better feature supervised by the corresponding text embedding. Thus, the 3D model is able to gain informative supervisions both from the image embedding and text embedding, leading to competitive segmentation performances. To the best of our knowledge, this is the first work to investigate 3D weakly supervised semantic segmentation by using the textual semantic information of text category labels. Moreover, with extensive quantitative and qualitative experiments, we present that our 3DSS-VLG is able not only to achieve the state-of-the-art performance on both S3DIS and ScanNet datasets, but also to maintain strong generalization capability.
♻ ☆ Prompt-Agnostic Adversarial Perturbation for Customized Diffusion Models
Diffusion models have revolutionized customized text-to-image generation, allowing for efficient synthesis of photos from personal data with textual descriptions. However, these advancements bring forth risks including privacy breaches and unauthorized replication of artworks. Previous researches primarily center around using prompt-specific methods to generate adversarial examples to protect personal images, yet the effectiveness of existing methods is hindered by constrained adaptability to different prompts. In this paper, we introduce a Prompt-Agnostic Adversarial Perturbation (PAP) method for customized diffusion models. PAP first models the prompt distribution using a Laplace Approximation, and then produces prompt-agnostic perturbations by maximizing a disturbance expectation based on the modeled distribution. This approach effectively tackles the prompt-agnostic attacks, leading to improved defense stability. Extensive experiments in face privacy and artistic style protection, demonstrate the superior generalization of our method in comparison to existing techniques.
comment: The experiments are insufficient and need to be completed
♻ ☆ Hydra-MDP: End-to-end Multimodal Planning with Multi-target Hydra-Distillation CVPR 2024
We propose Hydra-MDP, a novel paradigm employing multiple teachers in a teacher-student model. This approach uses knowledge distillation from both human and rule-based teachers to train the student model, which features a multi-head decoder to learn diverse trajectory candidates tailored to various evaluation metrics. With the knowledge of rule-based teachers, Hydra-MDP learns how the environment influences the planning in an end-to-end manner instead of resorting to non-differentiable post-processing. This method achieves the $1^{st}$ place in the Navsim challenge, demonstrating significant improvements in generalization across diverse driving environments and conditions. More details by visiting \url{https://github.com/NVlabs/Hydra-MDP}.
comment: The 1st place solution of End-to-end Driving at Scale at the CVPR 2024 Autonomous Grand Challenge
♻ ☆ VRSO: Visual-Centric Reconstruction for Static Object Annotation IROS
As a part of the perception results of intelligent driving systems, static object detection (SOD) in 3D space provides crucial cues for driving environment understanding. With the rapid deployment of deep neural networks for SOD tasks, the demand for high-quality training samples soars. The traditional, also reliable, way is manual labelling over the dense LiDAR point clouds and reference images. Though most public driving datasets adopt this strategy to provide SOD ground truth (GT), it is still expensive and time-consuming in practice. This paper introduces VRSO, a visual-centric approach for static object annotation. Experiments on the Waymo Open Dataset show that the mean reprojection error from VRSO annotation is only 2.6 pixels, around four times lower than the Waymo Open Dataset labels (10.6 pixels). VRSO is distinguished in low cost, high efficiency, and high quality: (1) It recovers static objects in 3D space with only camera images as input, and (2) manual annotation is barely involved since GT for SOD tasks is generated based on an automatic reconstruction and annotation pipeline.
comment: Accepted at 2024 IEEE International Conference on Intelligent Robots and Systems (IROS)
♻ ☆ An Asynchronous Linear Filter Architecture for Hybrid Event-Frame Cameras
Event cameras are ideally suited to capture High Dynamic Range (HDR) visual information without blur but provide poor imaging capability for static or slowly varying scenes. Conversely, conventional image sensors measure absolute intensity of slowly changing scenes effectively but do poorly on HDR or quickly changing scenes. In this paper, we present an asynchronous linear filter architecture, fusing event and frame camera data, for HDR video reconstruction and spatial convolution that exploits the advantages of both sensor modalities. The key idea is the introduction of a state that directly encodes the integrated or convolved image information and that is updated asynchronously as each event or each frame arrives from the camera. The state can be read-off as-often-as and whenever required to feed into subsequent vision modules for real-time robotic systems. Our experimental results are evaluated on both publicly available datasets with challenging lighting conditions and fast motions, along with a new dataset with HDR reference that we provide. The proposed AKF pipeline outperforms other state-of-the-art methods in both absolute intensity error (69.4% reduction) and image similarity indexes (average 35.5% improvement). We also demonstrate the integration of image convolution with linear spatial kernels Gaussian, Sobel, and Laplacian as an application of our architecture.
comment: 17 pages, 10 figures. Date of Publication: 04 September 2023
♻ ☆ LLaVA-SG: Leveraging Scene Graphs as Visual Semantic Expression in Vision-Language Models
Recent advances in large vision-language models (VLMs) typically employ vision encoders based on the Vision Transformer (ViT) architecture. The division of the images into patches by ViT results in a fragmented perception, thereby hindering the visual understanding capabilities of VLMs. In this paper, we propose an innovative enhancement to address this limitation by introducing a Scene Graph Expression (SGE) module in VLMs. This module extracts and structurally expresses the complex semantic information within images, thereby improving the foundational perception and understanding abilities of VLMs. Extensive experiments demonstrate that integrating our SGE module significantly enhances the VLM's performance in vision-language tasks, indicating its effectiveness in preserving intricate semantic details and facilitating better visual understanding.
♻ ☆ Dissecting Out-of-Distribution Detection and Open-Set Recognition: A Critical Analysis of Methods and Benchmarks
Detecting test-time distribution shift has emerged as a key capability for safely deployed machine learning models, with the question being tackled under various guises in recent years. In this paper, we aim to provide a consolidated view of the two largest sub-fields within the community: out-of-distribution (OOD) detection and open-set recognition (OSR). In particular, we aim to provide rigorous empirical analysis of different methods across settings and provide actionable takeaways for practitioners and researchers. Concretely, we make the following contributions: (i) We perform rigorous cross-evaluation between state-of-the-art methods in the OOD detection and OSR settings and identify a strong correlation between the performances of methods for them; (ii) We propose a new, large-scale benchmark setting which we suggest better disentangles the problem tackled by OOD detection and OSR, re-evaluating state-of-the-art OOD detection and OSR methods in this setting; (iii) We surprisingly find that the best performing method on standard benchmarks (Outlier Exposure) struggles when tested at scale, while scoring rules which are sensitive to the deep feature magnitude consistently show promise; and (iv) We conduct empirical analysis to explain these phenomena and highlight directions for future research. Code: https://github.com/Visual-AI/Dissect-OOD-OSR
comment: Accepted to IJCV, preprint version; v2: add supplementary
♻ ☆ MiniGPT-Reverse-Designing: Predicting Image Adjustments Utilizing MiniGPT-4
Vision-Language Models (VLMs) have recently seen significant advancements through integrating with Large Language Models (LLMs). The VLMs, which process image and text modalities simultaneously, have demonstrated the ability to learn and understand the interaction between images and texts across various multi-modal tasks. Reverse designing, which could be defined as a complex vision-language task, aims to predict the edits and their parameters, given a source image, an edited version, and an optional high-level textual edit description. This task requires VLMs to comprehend the interplay between the source image, the edited version, and the optional textual context simultaneously, going beyond traditional vision-language tasks. In this paper, we extend and fine-tune MiniGPT-4 for the reverse designing task. Our experiments demonstrate the extensibility of off-the-shelf VLMs, specifically MiniGPT-4, for more complex tasks such as reverse designing. Code is available at this \href{https://github.com/VahidAz/MiniGPT-Reverse-Designing}
comment: 8 pages, 7 figures
♻ ☆ Weakly-Supervised 3D Visual Grounding based on Visual Linguistic Alignment
Learning to ground natural language queries to target objects or regions in 3D point clouds is quite essential for 3D scene understanding. Nevertheless, existing 3D visual grounding approaches require a substantial number of bounding box annotations for text queries, which is time-consuming and labor-intensive to obtain. In this paper, we propose 3D-VLA, a weakly supervised approach for 3D visual grounding based on Visual Linguistic Alignment. Our 3D-VLA exploits the superior ability of current large-scale vision-language models (VLMs) on aligning the semantics between texts and 2D images, as well as the naturally existing correspondences between 2D images and 3D point clouds, and thus implicitly constructs correspondences between texts and 3D point clouds with no need for fine-grained box annotations in the training procedure. During the inference stage, the learned text-3D correspondence will help us ground the text queries to the 3D target objects even without 2D images. To the best of our knowledge, this is the first work to investigate 3D visual grounding in a weakly supervised manner by involving large scale vision-language models, and extensive experiments on ReferIt3D and ScanRefer datasets demonstrate that our 3D-VLA achieves comparable and even superior results over the fully supervised methods.
♻ ☆ DeepSpeak Dataset v1.0
We describe a large-scale dataset--DeepSpeak--of real and deepfake footage of people talking and gesturing in front of their webcams. The real videos in this first version of the dataset consist of 17 hours of footage from 220 diverse individuals. Constituting more than 26 hours of footage, the fake videos consist of a range of different state-of-the-art face-swap and lip-sync deepfakes with natural and AI-generated voices. We expect to release future versions of this dataset with different and updated deepfake technologies. This dataset is made freely available for research and non-commercial uses; requests for commercial use will be considered.
♻ ☆ Visual Multi-Object Tracking with Re-Identification and Occlusion Handling using Labeled Random Finite Sets
This paper proposes an online visual multi-object tracking (MOT) algorithm that resolves object appearance-reappearance and occlusion. Our solution is based on the labeled random finite set (LRFS) filtering approach, which in principle, addresses disappearance, appearance, reappearance, and occlusion via a single Bayesian recursion. However, in practice, existing numerical approximations cause reappearing objects to be initialized as new tracks, especially after long periods of being undetected. In occlusion handling, the filter's efficacy is dictated by trade-offs between the sophistication of the occlusion model and computational demand. Our contribution is a novel modeling method that exploits object features to address reappearing objects whilst maintaining a linear complexity in the number of detections. Moreover, to improve the filter's occlusion handling, we propose a fuzzy detection model that takes into consideration the overlapping areas between tracks and their sizes. We also develop a fast version of the filter to further reduce the computational time. The source code is publicly available at https://github.com/linh-gist/mv-glmb-ab.
♻ ☆ Criticality Leveraged Adversarial Training (CLAT) for Boosted Performance via Parameter Efficiency
Adversarial training enhances neural network robustness but suffers from a tendency to overfit and increased generalization errors on clean data. This work introduces CLAT, an innovative approach that mitigates adversarial overfitting by introducing parameter efficiency into the adversarial training process, improving both clean accuracy and adversarial robustness. Instead of tuning the entire model, CLAT identifies and fine-tunes robustness-critical layers - those predominantly learning non-robust features - while freezing the remaining model to enhance robustness. It employs dynamic critical layer selection to adapt to changes in layer criticality throughout the fine-tuning process. Empirically, CLAT can be applied on top of existing adversarial training methods, significantly reduces the number of trainable parameters by approximately 95%, and achieves more than a 2% improvement in adversarial robustness compared to baseline methods.
comment: 9 pages + appendix/ additional experiments
♻ ☆ Prompt Generation Networks for Input-Space Adaptation of Frozen Vision Transformers BMVC2024
With the introduction of the transformer architecture in computer vision, increasing model scale has been demonstrated as a clear path to achieving performance and robustness gains. However, with model parameter counts reaching the billions, classical finetuning approaches are becoming increasingly limiting and even unfeasible when models become hosted as inference APIs, as in NLP. Visual input-prompt learning, an adaptation technique in which additional inputs in visual (RGB) space are learned, has emerged as a potential solution for adapting frozen and cloud-hosted models, requiring neither access to the forward pass, nor post-processing. Yet so far, these constraints have deteriorated adaptation performances significantly. To this end, we propose the Prompt Generation Network (PGN) that generates a different prompt for every data point, which is then used to adapt a frozen pretrained vision model to a target task. We show that the PGN effectively adapts pretrained models to various new datasets: It surpasses previous methods by a large margin on 12/12 datasets and even outperforms full-finetuning on 5/12, while requiring 100x fewer parameters. Lastly, we introduce the "prompt inversion" trick, with which PGNs can be efficiently trained in a latent space but deployed in RGB input space for inference.
comment: Accepted by BMVC2024. Codebase: https://github.com/jochemloedeman/PGN
♻ ☆ PoCo: Point Context Cluster for RGBD Indoor Place Recognition
We present a novel end-to-end algorithm (PoCo) for the indoor RGB-D place recognition task, aimed at identifying the most likely match for a given query frame within a reference database. The task presents inherent challenges attributed to the constrained field of view and limited range of perception sensors. We propose a new network architecture, which generalizes the recent Context of Clusters (CoCs) to extract global descriptors directly from the noisy point clouds through end-to-end learning. Moreover, we develop the architecture by integrating both color and geometric modalities into the point features to enhance the global descriptor representation. We conducted evaluations on public datasets ScanNet-PR and ARKit with 807 and 5047 scenarios, respectively. PoCo achieves SOTA performance: on ScanNet-PR, we achieve R@1 of 64.63%, a 5.7% improvement from the best-published result CGis (61.12%); on Arkit, we achieve R@1 of 45.12%, a 13.3% improvement from the best-published result CGis (39.82%). In addition, PoCo shows higher efficiency than CGis in inference time (1.75X-faster), and we demonstrate the effectiveness of PoCo in recognizing places within a real-world laboratory environment.
♻ ☆ Mix-Domain Contrastive Learning for Unpaired H&E-to-IHC Stain Translation
H&E-to-IHC stain translation techniques offer a promising solution for precise cancer diagnosis, especially in low-resource regions where there is a shortage of health professionals and limited access to expensive equipment. Considering the pixel-level misalignment of H&E-IHC image pairs, current research explores the pathological consistency between patches from the same positions of the image pair. However, most of them overemphasize the correspondence between domains or patches, overlooking the side information provided by the non-corresponding objects. In this paper, we propose a Mix-Domain Contrastive Learning (MDCL) method to leverage the supervision information in unpaired H&E-to-IHC stain translation. Specifically, the proposed MDCL method aggregates the inter-domain and intra-domain pathology information by estimating the correlation between the anchor patch and all the patches from the matching images, encouraging the network to learn additional contrastive knowledge from mixed domains. With the mix-domain pathology information aggregation, MDCL enhances the pathological consistency between the corresponding patches and the component discrepancy of the patches from the different positions of the generated IHC image. Extensive experiments on two H&E-to-IHC stain translation datasets, namely MIST and BCI, demonstrate that the proposed method achieves state-of-the-art performance across multiple metrics.
♻ ☆ FRACTAL: An Ultra-Large-Scale Aerial Lidar Dataset for 3D Semantic Segmentation of Diverse Landscapes
Mapping agencies are increasingly adopting Aerial Lidar Scanning (ALS) as a new tool to map buildings and other above-ground structures. Processing ALS data at scale requires efficient point classification methods that perform well over highly diverse territories. Large annotated Lidar datasets are needed to evaluate these classification methods, however, current Lidar benchmarks have restricted scope and often cover a single urban area. To bridge this data gap, we introduce the FRench ALS Clouds from TArgeted Landscapes (FRACTAL) dataset: an ultra-large-scale aerial Lidar dataset made of 100,000 dense point clouds with high quality labels for 7 semantic classes and spanning 250 km$^2$. FRACTAL achieves high spatial and semantic diversity by explicitly sampling rare classes and challenging landscapes from five different regions of France. We describe the data collection, annotation, and curation process of the dataset. We provide baseline semantic segmentation results using a state of the art 3D point cloud classification model. FRACTAL aims to support the development of 3D deep learning approaches for large-scale land monitoring.
comment: 9 (body) + 2 (bibliography) + 8 (appendices) pages | Dataset is available at https://huggingface.co/datasets/IGNF/FRACTAL | Trained model is available at https://huggingface.co/IGNF/FRACTAL-LidarHD_7cl_randlanet | Deep learning code repository is on Gihtub at https://github.com/IGNF/myria3d | Data engineering code repository is on Github at https://github.com/IGNF/pacasam
♻ ☆ Matryoshka Diffusion Models ICLR2024
Diffusion models are the de facto approach for generating high-quality images and videos, but learning high-dimensional models remains a formidable task due to computational and optimization challenges. Existing methods often resort to training cascaded models in pixel space or using a downsampled latent space of a separately trained auto-encoder. In this paper, we introduce Matryoshka Diffusion Models(MDM), an end-to-end framework for high-resolution image and video synthesis. We propose a diffusion process that denoises inputs at multiple resolutions jointly and uses a NestedUNet architecture where features and parameters for small-scale inputs are nested within those of large scales. In addition, MDM enables a progressive training schedule from lower to higher resolutions, which leads to significant improvements in optimization for high-resolution generation. We demonstrate the effectiveness of our approach on various benchmarks, including class-conditioned image generation, high-resolution text-to-image, and text-to-video applications. Remarkably, we can train a single pixel-space model at resolutions of up to 1024x1024 pixels, demonstrating strong zero-shot generalization using the CC12M dataset, which contains only 12 million images. Our code is released at https://github.com/apple/ml-mdm
comment: Accepted by ICLR2024
♻ ☆ Motion Avatar: Generate Human and Animal Avatars with Arbitrary Motion BMVC 2024
In recent years, there has been significant interest in creating 3D avatars and motions, driven by their diverse applications in areas like film-making, video games, AR/VR, and human-robot interaction. However, current efforts primarily concentrate on either generating the 3D avatar mesh alone or producing motion sequences, with integrating these two aspects proving to be a persistent challenge. Additionally, while avatar and motion generation predominantly target humans, extending these techniques to animals remains a significant challenge due to inadequate training data and methods. To bridge these gaps, our paper presents three key contributions. Firstly, we proposed a novel agent-based approach named Motion Avatar, which allows for the automatic generation of high-quality customizable human and animal avatars with motions through text queries. The method significantly advanced the progress in dynamic 3D character generation. Secondly, we introduced a LLM planner that coordinates both motion and avatar generation, which transforms a discriminative planning into a customizable Q&A fashion. Lastly, we presented an animal motion dataset named Zoo-300K, comprising approximately 300,000 text-motion pairs across 65 animal categories and its building pipeline ZooGen, which serves as a valuable resource for the community. See project website https://steve-zeyu-zhang.github.io/MotionAvatar/
comment: Accepted to BMVC 2024
Information Retrieval
☆ rerankers: A Lightweight Python Library to Unify Ranking Methods
This paper presents rerankers, a Python library which provides an easy-to-use interface to the most commonly used re-ranking approaches. Re-ranking is an integral component of many retrieval pipelines; however, there exist numerous approaches to it, relying on different implementation methods. \texttt{rerankers} unifies these methods into a single user-friendly interface, allowing practitioners and researchers alike to explore different methods while only changing a single line of Python code. Moreover ,rerankers ensures that its implementations are done with the fewest dependencies possible, and re-uses the original implementation whenever possible, guaranteeing that our simplified interface results in no performance degradation compared to more complex ones. The full source code and list of supported models are updated regularly and available at https://github.com/answerdotai/rerankers.
☆ Not All Videos Become Outdated: Short-Video Recommendation by Learning to Deconfound Release Interval Bias
Short-video recommender systems often exhibit a biased preference to recently released videos. However, not all videos become outdated; certain classic videos can still attract user's attention. Such bias along temporal dimension can be further aggravated by the matching model between users and videos, because the model learns from preexisting interactions. From real data, we observe that different videos have varying sensitivities to recency in attracting users' attention. Our analysis, based on a causal graph modeling short-video recommendation, suggests that the release interval serves as a confounder, establishing a backdoor path between users and videos. To address this confounding effect, we propose a model-agnostic causal architecture called Learning to Deconfound the Release Interval Bias (LDRI). LDRI enables jointly learning of the matching model and the video recency sensitivity perceptron. In the inference stage, we apply a backdoor adjustment, effectively blocking the backdoor path by intervening on each video. Extensive experiments on two benchmarks demonstrate that LDRI consistently outperforms backbone models and exhibits superior performance against state-of-the-art models. Additional comprehensive analyses confirm the deconfounding capability of LDRI.
☆ Metadata practices for simulation workflows
Computer simulations are an essential pillar of knowledge generation in science. Understanding, reproducing, and exploring the results of simulations relies on tracking and organizing metadata describing numerical experiments. However, the models used to understand real-world systems, and the computational machinery required to simulate them, are typically complex, and produce large amounts of heterogeneous metadata. Here, we present general practices for acquiring and handling metadata that are agnostic to software and hardware, and highly flexible for the user. These consist of two steps: 1) recording and storing raw metadata, and 2) selecting and structuring metadata. As a proof of concept, we develop the Archivist, a Python tool to help with the second step, and use it to apply our practices to distinct high-performance computing use cases from neuroscience and hydrology. Our practices and the Archivist can readily be applied to existing workflows without the need for substantial restructuring. They support sustainable numerical workflows, facilitating reproducibility and data reuse in generic simulation-based research.
comment: 19 pages, 5 figures
☆ Efficient Multi-task Prompt Tuning for Recommendation
With the expansion of business scenarios, real recommender systems are facing challenges in dealing with the constantly emerging new tasks in multi-task learning frameworks. In this paper, we attempt to improve the generalization ability of multi-task recommendations when dealing with new tasks. We find that joint training will enhance the performance of the new task but always negatively impact existing tasks in most multi-task learning methods. Besides, such a re-training mechanism with new tasks increases the training costs, limiting the generalization ability of multi-task recommendation models. Based on this consideration, we aim to design a suitable sharing mechanism among different tasks while maintaining joint optimization efficiency in new task learning. A novel two-stage prompt-tuning MTL framework (MPT-Rec) is proposed to address task irrelevance and training efficiency problems in multi-task recommender systems. Specifically, we disentangle the task-specific and task-sharing information in the multi-task pre-training stage, then use task-aware prompts to transfer knowledge from other tasks to the new task effectively. By freezing parameters in the pre-training tasks, MPT-Rec solves the negative impacts that may be brought by the new task and greatly reduces the training costs. Extensive experiments on three real-world datasets show the effectiveness of our proposed multi-task learning framework. MPT-Rec achieves the best performance compared to the SOTA multi-task learning method. Besides, it maintains comparable model performance but vastly improves the training efficiency (i.e., with up to 10% parameters in the full training way) in the new task learning.
☆ Identifying and Clustering Counter Relationships of Team Compositions in PvP Games for Efficient Balance Analysis
How can balance be quantified in game settings? This question is crucial for game designers, especially in player-versus-player (PvP) games, where analyzing the strength relations among predefined team compositions-such as hero combinations in multiplayer online battle arena (MOBA) games or decks in card games-is essential for enhancing gameplay and achieving balance. We have developed two advanced measures that extend beyond the simplistic win rate to quantify balance in zero-sum competitive scenarios. These measures are derived from win value estimations, which employ strength rating approximations via the Bradley-Terry model and counter relationship approximations via vector quantization, significantly reducing the computational complexity associated with traditional win value estimations. Throughout the learning process of these models, we identify useful categories of compositions and pinpoint their counter relationships, aligning with the experiences of human players without requiring specific game knowledge. Our methodology hinges on a simple technique to enhance codebook utilization in discrete representation with a deterministic vector quantization process for an extremely small state space. Our framework has been validated in popular online games, including Age of Empires II, Hearthstone, Brawl Stars, and League of Legends. The accuracy of the observed strength relations in these games is comparable to traditional pairwise win value predictions, while also offering a more manageable complexity for analysis. Ultimately, our findings contribute to a deeper understanding of PvP game dynamics and present a methodology that significantly improves game balance evaluation and design.
comment: TMLR 09/2024 https://openreview.net/forum?id=2D36otXvBE
☆ Understanding the User: An Intent-Based Ranking Dataset
As information retrieval systems continue to evolve, accurate evaluation and benchmarking of these systems become pivotal. Web search datasets, such as MS MARCO, primarily provide short keyword queries without accompanying intent or descriptions, posing a challenge in comprehending the underlying information need. This paper proposes an approach to augmenting such datasets to annotate informative query descriptions, with a focus on two prominent benchmark datasets: TREC-DL-21 and TREC-DL-22. Our methodology involves utilizing state-of-the-art LLMs to analyze and comprehend the implicit intent within individual queries from benchmark datasets. By extracting key semantic elements, we construct detailed and contextually rich descriptions for these queries. To validate the generated query descriptions, we employ crowdsourcing as a reliable means of obtaining diverse human perspectives on the accuracy and informativeness of the descriptions. This information can be used as an evaluation set for tasks such as ranking, query rewriting, or others.
☆ Evaluation of Table Representations to Answer Questions from Tables in Documents : A Case Study using 3GPP Specifications
With the ubiquitous use of document corpora for question answering, one important aspect which is especially relevant for technical documents is the ability to extract information from tables which are interspersed with text. The major challenge in this is that unlike free-flow text or isolated set of tables, the representation of a table in terms of what is a relevant chunk is not obvious. We conduct a series of experiments examining various representations of tabular data interspersed with text to understand the relative benefits of different representations. We choose a corpus of $3^{rd}$ Generation Partnership Project (3GPP) documents since they are heavily interspersed with tables. We create expert curated dataset of question answers to evaluate our approach. We conclude that row level representations with corresponding table header information being included in every cell improves the performance of the retrieval, thus leveraging the structural information present in the tabular data.
comment: 10 pages, 4 figures, 2 tables
☆ Facilitating phenotyping from clinical texts: the medkit library
Phenotyping consists in applying algorithms to identify individuals associated with a specific, potentially complex, trait or condition, typically out of a collection of Electronic Health Records (EHRs). Because a lot of the clinical information of EHRs are lying in texts, phenotyping from text takes an important role in studies that rely on the secondary use of EHRs. However, the heterogeneity and highly specialized aspect of both the content and form of clinical texts makes this task particularly tedious, and is the source of time and cost constraints in observational studies. To facilitate the development, evaluation and reproductibility of phenotyping pipelines, we developed an open-source Python library named medkit. It enables composing data processing pipelines made of easy-to-reuse software bricks, named medkit operations. In addition to the core of the library, we share the operations and pipelines we already developed and invite the phenotyping community for their reuse and enrichment. medkit is available at https://github.com/medkit-lib/medkit
♻ ☆ Beyond One-Size-Fits-All: Multi-Domain, Multi-Task Framework for Embedding Model Selection
This position paper proposes a systematic approach towards developing a framework to help select the most effective embedding models for natural language processing (NLP) tasks, addressing the challenge posed by the proliferation of both proprietary and open-source encoder models.
comment: It was an initial idea - we plan to work on a detailed version
♻ ☆ SynDL: A Large-Scale Synthetic Test Collection for Passage Retrieval
Large-scale test collections play a crucial role in Information Retrieval (IR) research. However, according to the Cranfield paradigm and the research into publicly available datasets, the existing information retrieval research studies are commonly developed on small-scale datasets that rely on human assessors for relevance judgments - a time-intensive and expensive process. Recent studies have shown the strong capability of Large Language Models (LLMs) in producing reliable relevance judgments with human accuracy but at a greatly reduced cost. In this paper, to address the missing large-scale ad-hoc document retrieval dataset, we extend the TREC Deep Learning Track (DL) test collection via additional language model synthetic labels to enable researchers to test and evaluate their search systems at a large scale. Specifically, such a test collection includes more than 1,900 test queries from the previous years of tracks. We compare system evaluation with past human labels from past years and find that our synthetically created large-scale test collection can lead to highly correlated system rankings.
comment: 9 pages, resource paper
♻ ☆ The Importance of Cognitive Biases in the Recommendation Ecosystem
Cognitive biases have been studied in psychology, sociology, and behavioral economics for decades. Traditionally, they have been considered a negative human trait that leads to inferior decision-making, reinforcement of stereotypes, or can be exploited to manipulate consumers, respectively. We argue that cognitive biases also manifest in different parts of the recommendation ecosystem and at different stages of the recommendation process. More importantly, we contest this traditional detrimental perspective on cognitive biases and claim that certain cognitive biases can be beneficial when accounted for by recommender systems. Concretely, we provide empirical evidence that biases such as feature-positive effect, Ikea effect, and cultural homophily can be observed in various components of the recommendation pipeline, including input data (such as ratings or side information), recommendation algorithm or model (and consequently recommended items), and user interactions with the system. In three small experiments covering recruitment and entertainment domains, we study the pervasiveness of the aforementioned biases. We ultimately advocate for a prejudice-free consideration of cognitive biases to improve user and item models as well as recommendation algorithms.
♻ ☆ Perceptual Similarity for Measuring Decision-Making Style and Policy Diversity in Games
Defining and measuring decision-making styles, also known as playstyles, is crucial in gaming, where these styles reflect a broad spectrum of individuality and diversity. However, finding a universally applicable measure for these styles poses a challenge. Building on Playstyle Distance, the first unsupervised metric to measure playstyle similarity based on game screens and raw actions, we introduce three enhancements to increase accuracy: multiscale analysis with varied state granularity, a perceptual kernel rooted in psychology, and the utilization of the intersection-over-union method for efficient evaluation. These innovations not only advance measurement precision but also offer insights into human cognition of similarity. Across two racing games and seven Atari games, our techniques significantly improve the precision of zero-shot playstyle classification, achieving an accuracy exceeding 90 percent with fewer than 512 observation-action pairs, which is less than half an episode of these games. Furthermore, our experiments with 2048 and Go demonstrate the potential of discrete playstyle measures in puzzle and board games. We also develop an algorithm for assessing decision-making diversity using these measures. Our findings improve the measurement of end-to-end game analysis and the evolution of artificial intelligence for diverse playstyles.
comment: TMLR 08/2024 https://openreview.net/forum?id=30C9AWBW49
♻ ☆ Jina-ColBERT-v2: A General-Purpose Multilingual Late Interaction Retriever
Multi-vector dense models, such as ColBERT, have proven highly effective in information retrieval. ColBERT's late interaction scoring approximates the joint query-document attention seen in cross-encoders while maintaining inference efficiency closer to traditional dense retrieval models, thanks to its bi-encoder architecture and recent optimizations in indexing and search. In this paper, we introduce a novel architecture and a training framework to support long context window and multilingual retrieval. Leveraging Matryoshka Representation Loss, we further demonstrate that the reducing the embedding dimensionality from 128 to 64 has insignificant impact on the model's retrieval performance and cut storage requirements by up to 50%. Our new model, Jina-ColBERT-v2, demonstrates strong performance across a range of English and multilingual retrieval tasks,
Machine Learning
☆ SelectTTS: Synthesizing Anyone's Voice via Discrete Unit-Based Frame Selection
Synthesizing the voices of unseen speakers is a persisting challenge in multi-speaker text-to-speech (TTS). Most multi-speaker TTS models rely on modeling speaker characteristics through speaker conditioning during training. Modeling unseen speaker attributes through this approach has necessitated an increase in model complexity, which makes it challenging to reproduce results and improve upon them. We design a simple alternative to this. We propose SelectTTS, a novel method to select the appropriate frames from the target speaker and decode using frame-level self-supervised learning (SSL) features. We show that this approach can effectively capture speaker characteristics for unseen speakers, and achieves comparable results to other multi-speaker TTS frameworks in both objective and subjective metrics. With SelectTTS, we show that frame selection from the target speaker's speech is a direct way to achieve generalization in unseen speakers with low model complexity. We achieve better speaker similarity performance than SOTA baselines XTTS-v2 and VALL-E with over an 8x reduction in model parameters and a 270x reduction in training data
comment: Submitted to IEEE Signal Processing Letters
☆ Fairness-Aware Estimation of Graphical Models
This paper examines the issue of fairness in the estimation of graphical models (GMs), particularly Gaussian, Covariance, and Ising models. These models play a vital role in understanding complex relationships in high-dimensional data. However, standard GMs can result in biased outcomes, especially when the underlying data involves sensitive characteristics or protected groups. To address this, we introduce a comprehensive framework designed to reduce bias in the estimation of GMs related to protected attributes. Our approach involves the integration of the pairwise graph disparity error and a tailored loss function into a nonsmooth multi-objective optimization problem, striving to achieve fairness across different sensitive groups while maintaining the effectiveness of the GMs. Experimental evaluations on synthetic and real-world datasets demonstrate that our framework effectively mitigates bias without undermining GMs' performance.
comment: 32 Pages, 9 Figures
☆ Continual learning with the neural tangent ensemble
A natural strategy for continual learning is to weigh a Bayesian ensemble of fixed functions. This suggests that if a (single) neural network could be interpreted as an ensemble, one could design effective algorithms that learn without forgetting. To realize this possibility, we observe that a neural network classifier with N parameters can be interpreted as a weighted ensemble of N classifiers, and that in the lazy regime limit these classifiers are fixed throughout learning. We term these classifiers the neural tangent experts and show they output valid probability distributions over the labels. We then derive the likelihood and posterior probability of each expert given past data. Surprisingly, we learn that the posterior updates for these experts are equivalent to a scaled and projected form of stochastic gradient descent (SGD) over the network weights. Away from the lazy regime, networks can be seen as ensembles of adaptive experts which improve over time. These results offer a new interpretation of neural networks as Bayesian ensembles of experts, providing a principled framework for understanding and mitigating catastrophic forgetting in continual learning settings.
☆ Bayesian Optimization for Non-Convex Two-Stage Stochastic Optimization Problems
Bayesian optimization is a sample-efficient method for solving expensive, black-box optimization problems. Stochastic programming concerns optimization under uncertainty where, typically, average performance is the quantity of interest. In the first stage of a two-stage problem, here-and-now decisions must be made in the face of this uncertainty, while in the second stage, wait-and-see decisions are made after the uncertainty has been resolved. Many methods in stochastic programming assume that the objective is cheap to evaluate and linear or convex. In this work, we apply Bayesian optimization to solve non-convex, two-stage stochastic programs which are expensive to evaluate. We formulate a knowledge-gradient-based acquisition function to jointly optimize the first- and second-stage variables, establish a guarantee of asymptotic consistency and provide a computationally efficient approximation. We demonstrate comparable empirical results to an alternative we formulate which alternates its focus between the two variable types, and superior empirical results over the standard, naive, two-step benchmark. We show that differences in the dimension and length scales between the variable types can lead to inefficiencies of the two-step algorithm, while the joint and alternating acquisition functions perform well in all problems tested. Experiments are conducted on both synthetic and real-world examples.
☆ LASSO-MOGAT: A Multi-Omics Graph Attention Framework for Cancer Classification
The application of machine learning methods to analyze changes in gene expression patterns has recently emerged as a powerful approach in cancer research, enhancing our understanding of the molecular mechanisms underpinning cancer development and progression. Combining gene expression data with other types of omics data has been reported by numerous works to improve cancer classification outcomes. Despite these advances, effectively integrating high-dimensional multi-omics data and capturing the complex relationships across different biological layers remains challenging. This paper introduces LASSO-MOGAT (LASSO-Multi-Omics Gated ATtention), a novel graph-based deep learning framework that integrates messenger RNA, microRNA, and DNA methylation data to classify 31 cancer types. Utilizing differential expression analysis with LIMMA and LASSO regression for feature selection, and leveraging Graph Attention Networks (GATs) to incorporate protein-protein interaction (PPI) networks, LASSO-MOGAT effectively captures intricate relationships within multi-omics data. Experimental validation using five-fold cross-validation demonstrates the method's precision, reliability, and capacity for providing comprehensive insights into cancer molecular mechanisms. The computation of attention coefficients for the edges in the graph by the proposed graph-attention architecture based on protein-protein interactions proved beneficial for identifying synergies in multi-omics data for cancer classification.
☆ MoRe Fine-Tuning with 10x Fewer Parameters
Parameter-efficient fine-tuning (PEFT) techniques have unlocked the potential to cheaply and easily specialize large pretrained models. However, the most prominent approaches, like low-rank adapters (LoRA), depend on heuristics or rules-of-thumb for their architectural choices -- potentially limiting their performance for new models and architectures. This limitation suggests that techniques from neural architecture search could be used to obtain optimal adapter architectures, but these are often expensive and difficult to implement. We address this challenge with Monarch Rectangular Fine-tuning (MoRe), a simple framework to search over adapter architectures that relies on the Monarch matrix class. Theoretically, we show that MoRe is more expressive than LoRA. Empirically, our approach is more parameter-efficient and performant than state-of-the-art PEFTs on a range of tasks and models, with as few as 5\% of LoRA's parameters.
☆ Traffic expertise meets residual RL: Knowledge-informed model-based residual reinforcement learning for CAV trajectory control
Model-based reinforcement learning (RL) is anticipated to exhibit higher sample efficiency compared to model-free RL by utilizing a virtual environment model. However, it is challenging to obtain sufficiently accurate representations of the environmental dynamics due to uncertainties in complex systems and environments. An inaccurate environment model may degrade the sample efficiency and performance of model-based RL. Furthermore, while model-based RL can improve sample efficiency, it often still requires substantial training time to learn from scratch, potentially limiting its advantages over model-free approaches. To address these challenges, this paper introduces a knowledge-informed model-based residual reinforcement learning framework aimed at enhancing learning efficiency by infusing established expert knowledge into the learning process and avoiding the issue of beginning from zero. Our approach integrates traffic expert knowledge into a virtual environment model, employing the Intelligent Driver Model (IDM) for basic dynamics and neural networks for residual dynamics, thus ensuring adaptability to complex scenarios. We propose a novel strategy that combines traditional control methods with residual RL, facilitating efficient learning and policy optimization without the need to learn from scratch. The proposed approach is applied to CAV trajectory control tasks for the dissipation of stop-and-go waves in mixed traffic flow. Experimental results demonstrate that our proposed approach enables the CAV agent to achieve superior performance in trajectory control compared to the baseline agents in terms of sample efficiency, traffic flow smoothness and traffic mobility. The source code and supplementary materials are available at https://github.com/zihaosheng/traffic-expertise-RL/.
☆ Exploring the Impact of Environmental Pollutants on Multiple Sclerosis Progression
Multiple Sclerosis (MS) is a chronic autoimmune and inflammatory neurological disorder characterised by episodes of symptom exacerbation, known as relapses. In this study, we investigate the role of environmental factors in relapse occurrence among MS patients, using data from the H2020 BRAINTEASER project. We employed predictive models, including Random Forest (RF) and Logistic Regression (LR), with varying sets of input features to predict the occurrence of relapses based on clinical and pollutant data collected over a week. The RF yielded the best result, with an AUC-ROC score of 0.713. Environmental variables, such as precipitation, NO2, PM2.5, humidity, and temperature, were found to be relevant to the prediction.
☆ Leveraging Graph Neural Networks to Forecast Electricity Consumption ECML
Accurate electricity demand forecasting is essential for several reasons, especially as the integration of renewable energy sources and the transition to a decentralized network paradigm introduce greater complexity and uncertainty. The proposed methodology leverages graph-based representations to effectively capture the spatial distribution and relational intricacies inherent in this decentralized network structure. This research work offers a novel approach that extends beyond the conventional Generalized Additive Model framework by considering models like Graph Convolutional Networks or Graph SAGE. These graph-based models enable the incorporation of various levels of interconnectedness and information sharing among nodes, where each node corresponds to the combined load (i.e. consumption) of a subset of consumers (e.g. the regions of a country). More specifically, we introduce a range of methods for inferring graphs tailored to consumption forecasting, along with a framework for evaluating the developed models in terms of both performance and explainability. We conduct experiments on electricity forecasting, in both a synthetic and a real framework considering the French mainland regions, and the performance and merits of our approach are discussed.
comment: 17 pages, ECML PKDD 2024 Workshop paper
☆ Hold Me Tight: Stable Encoder-Decoder Design for Speech Enhancement INTERSPEECH 2024
Convolutional layers with 1-D filters are often used as frontend to encode audio signals. Unlike fixed time-frequency representations, they can adapt to the local characteristics of input data. However, 1-D filters on raw audio are hard to train and often suffer from instabilities. In this paper, we address these problems with hybrid solutions, i.e., combining theory-driven and data-driven approaches. First, we preprocess the audio signals via a auditory filterbank, guaranteeing good frequency localization for the learned encoder. Second, we use results from frame theory to define an unsupervised learning objective that encourages energy conservation and perfect reconstruction. Third, we adapt mixed compressed spectral norms as learning objectives to the encoder coefficients. Using these solutions in a low-complexity encoder-mask-decoder model significantly improves the perceptual evaluation of speech quality (PESQ) in speech enhancement.
comment: Accepted at INTERSPEECH 2024
☆ C-RADAR: A Centralized Deep Learning System for Intrusion Detection in Software Defined Networks
The popularity of Software Defined Networks (SDNs) has grown in recent years, mainly because of their ability to simplify network management and improve network flexibility. However, this also makes them vulnerable to various types of cyber attacks. SDNs work on a centralized control plane which makes them more prone to network attacks. Research has demonstrated that deep learning (DL) methods can be successful in identifying intrusions in conventional networks, but their application in SDNs is still an open research area. In this research, we propose the use of DL techniques for intrusion detection in SDNs. We measure the effectiveness of our method by experimentation on a dataset of network traffic and comparing it to existing techniques. Our results show that the DL-based approach outperforms traditional methods in terms of detection accuracy and computational efficiency. The deep learning architecture that has been used in this research is a Long Short Term Memory Network and Self-Attention based architecture i.e. LSTM-Attn which achieves an Fl-score of 0.9721. Furthermore, this technique can be trained to detect new attack patterns and improve the overall security of SDNs.
☆ Bidirectional Decoding: Improving Action Chunking via Closed-Loop Resampling
Predicting and executing a sequence of actions without intermediate replanning, known as action chunking, is increasingly used in robot learning from human demonstrations. However, its effects on learned policies remain puzzling: some studies highlight its importance for achieving strong performance, while others observe detrimental effects. In this paper, we first dissect the role of action chunking by analyzing the divergence between the learner and the demonstrator. We find that longer action chunks enable a policy to better capture temporal dependencies by taking into account more past states and actions within the chunk. However, this advantage comes at the cost of exacerbating errors in stochastic environments due to fewer observations of recent states. To address this, we propose Bidirectional Decoding (BID), a test-time inference algorithm that bridges action chunking with closed-loop operations. BID samples multiple predictions at each time step and searches for the optimal one based on two criteria: (i) backward coherence, which favors samples aligned with previous decisions, (ii) forward contrast, which favors samples close to outputs of a stronger policy and distant from those of a weaker policy. By coupling decisions within and across action chunks, BID enhances temporal consistency over extended sequences while enabling adaptive replanning in stochastic environments. Experimental results show that BID substantially outperforms conventional closed-loop operations of two state-of-the-art generative policies across seven simulation benchmarks and two real-world tasks.
comment: Project website: https://bid-robot.github.io/
☆ Forget to Flourish: Leveraging Machine-Unlearning on Pretrained Language Models for Privacy Leakage
Fine-tuning large language models on private data for downstream applications poses significant privacy risks in potentially exposing sensitive information. Several popular community platforms now offer convenient distribution of a large variety of pre-trained models, allowing anyone to publish without rigorous verification. This scenario creates a privacy threat, as pre-trained models can be intentionally crafted to compromise the privacy of fine-tuning datasets. In this study, we introduce a novel poisoning technique that uses model-unlearning as an attack tool. This approach manipulates a pre-trained language model to increase the leakage of private data during the fine-tuning process. Our method enhances both membership inference and data extraction attacks while preserving model utility. Experimental results across different models, datasets, and fine-tuning setups demonstrate that our attacks significantly surpass baseline performance. This work serves as a cautionary note for users who download pre-trained models from unverified sources, highlighting the potential risks involved.
☆ Evaluating Reliability in Medical DNNs: A Critical Analysis of Feature and Confidence-Based OOD Detection MICCAI 2023
Reliable use of deep neural networks (DNNs) for medical image analysis requires methods to identify inputs that differ significantly from the training data, called out-of-distribution (OOD), to prevent erroneous predictions. OOD detection methods can be categorised as either confidence-based (using the model's output layer for OOD detection) or feature-based (not using the output layer). We created two new OOD benchmarks by dividing the D7P (dermatology) and BreastMNIST (ultrasound) datasets into subsets which either contain or don't contain an artefact (rulers or annotations respectively). Models were trained with artefact-free images, and images with the artefacts were used as OOD test sets. For each OOD image, we created a counterfactual by manually removing the artefact via image processing, to assess the artefact's impact on the model's predictions. We show that OOD artefacts can boost a model's softmax confidence in its predictions, due to correlations in training data among other factors. This contradicts the common assumption that OOD artefacts should lead to more uncertain outputs, an assumption on which most confidence-based methods rely. We use this to explain why feature-based methods (e.g. Mahalanobis score) typically have greater OOD detection performance than confidence-based methods (e.g. MCP). However, we also show that feature-based methods typically perform worse at distinguishing between inputs that lead to correct and incorrect predictions (for both OOD and ID data). Following from these insights, we argue that a combination of feature-based and confidence-based methods should be used within DNN pipelines to mitigate their respective weaknesses. These project's code and OOD benchmarks are available at: https://github.com/HarryAnthony/Evaluating_OOD_detection.
comment: Accepted for the Uncertainty for Safe Utilization of Machine Learning in Medical Imaging (UNSURE 2024) workshop at the MICCAI 2023
☆ Estimation of Cardiac and Non-cardiac Diagnosis from Electrocardiogram Features
Introduction: Ensuring timely and accurate diagnosis of medical conditions is paramount for effective patient care. Electrocardiogram (ECG) signals are fundamental for evaluating a patient's cardiac health and are readily available. Despite this, little attention has been given to the remarkable potential of ECG data in detecting non-cardiac conditions. Methods: In our study, we used publicly available datasets (MIMIC-IV-ECG-ICD and ECG-VIEW II) to investigate the feasibility of inferring general diagnostic conditions from ECG features. To this end, we trained a tree-based model (XGBoost) based on ECG features and basic demographic features to estimate a wide range of diagnoses, encompassing both cardiac and non-cardiac conditions. Results: Our results demonstrate the reliability of estimating 23 cardiac as well as 21 non-cardiac conditions above 0.7 AUROC in a statistically significant manner across a wide range of physiological categories. Our findings underscore the predictive potential of ECG data in identifying well-known cardiac conditions. However, even more striking, this research represents a pioneering effort in systematically expanding the scope of ECG-based diagnosis to conditions not traditionally associated with the cardiac system.
comment: 4 pages, source code under https://github.com/AI4HealthUOL/CardioDiag
☆ Modularity in Transformers: Investigating Neuron Separability & Specialization
Transformer models are increasingly prevalent in various applications, yet our understanding of their internal workings remains limited. This paper investigates the modularity and task specialization of neurons within transformer architectures, focusing on both vision (ViT) and language (Mistral 7B) models. Using a combination of selective pruning and MoEfication clustering techniques, we analyze the overlap and specialization of neurons across different tasks and data subsets. Our findings reveal evidence of task-specific neuron clusters, with varying degrees of overlap between related tasks. We observe that neuron importance patterns persist to some extent even in randomly initialized models, suggesting an inherent structure that training refines. Additionally, we find that neuron clusters identified through MoEfication correspond more strongly to task-specific neurons in earlier and later layers of the models. This work contributes to a more nuanced understanding of transformer internals and offers insights into potential avenues for improving model interpretability and efficiency.
comment: 11 pages, 6 figures
☆ Investigating Neuron Ablation in Attention Heads: The Case for Peak Activation Centering
The use of transformer-based models is growing rapidly throughout society. With this growth, it is important to understand how they work, and in particular, how the attention mechanisms represent concepts. Though there are many interpretability methods, many look at models through their neuronal activations, which are poorly understood. We describe different lenses through which to view neuron activations, and investigate the effectiveness in language models and vision transformers through various methods of neural ablation: zero ablation, mean ablation, activation resampling, and a novel approach we term 'peak ablation'. Through experimental analysis, we find that in different regimes and models, each method can offer the lowest degradation of model performance compared to other methods, with resampling usually causing the most significant performance deterioration. We make our code available at https://github.com/nickypro/investigating-ablation.
comment: 9 pages, 2 figures, XAI World Conference 2024 Late-Breaking Work
☆ Fair Best Arm Identification with Fixed Confidence
In this work, we present a novel framework for Best Arm Identification (BAI) under fairness constraints, a setting that we refer to as \textit{F-BAI} (fair BAI). Unlike traditional BAI, which solely focuses on identifying the optimal arm with minimal sample complexity, F-BAI also includes a set of fairness constraints. These constraints impose a lower limit on the selection rate of each arm and can be either model-agnostic or model-dependent. For this setting, we establish an instance-specific sample complexity lower bound and analyze the \textit{price of fairness}, quantifying how fairness impacts sample complexity. Based on the sample complexity lower bound, we propose F-TaS, an algorithm provably matching the sample complexity lower bound, while ensuring that the fairness constraints are satisfied. Numerical results, conducted using both a synthetic model and a practical wireless scheduling application, show the efficiency of F-TaS in minimizing the sample complexity while achieving low fairness violations.
☆ Structuring a Training Strategy to Robustify Perception Models with Realistic Image Augmentations
Advancing Machine Learning (ML)-based perception models for autonomous systems necessitates addressing weak spots within the models, particularly in challenging Operational Design Domains (ODDs). These are environmental operating conditions of an autonomous vehicle which can contain difficult conditions, e.g., lens flare at night or objects reflected in a wet street. This report introduces a novel methodology for training with augmentations to enhance model robustness and performance in such conditions. The proposed approach leverages customized physics-based augmentation functions, to generate realistic training data that simulates diverse ODD scenarios. We present a comprehensive framework that includes identifying weak spots in ML models, selecting suitable augmentations, and devising effective training strategies. The methodology integrates hyperparameter optimization and latent space optimization to fine-tune augmentation parameters, ensuring they maximally improve the ML models' performance. Experimental results demonstrate improvements in model performance, as measured by commonly used metrics such as mean Average Precision (mAP) and mean Intersection over Union (mIoU) on open-source object detection and semantic segmentation models and datasets. Our findings emphasize that optimal training strategies are model- and data-specific and highlight the benefits of integrating augmentations into the training pipeline. By incorporating augmentations, we observe enhanced robustness of ML-based perception models, making them more resilient to edge cases encountered in real-world ODDs. This work underlines the importance of customized augmentations and offers an effective solution for improving the safety and reliability of autonomous driving functions.
☆ Hybridizing Base-Line 2D-CNN Model with Cat Swarm Optimization for Enhanced Advanced Persistent Threat Detection
In the realm of cyber-security, detecting Advanced Persistent Threats (APTs) remains a formidable challenge due to their stealthy and sophisticated nature. This research paper presents an innovative approach that leverages Convolutional Neural Networks (CNNs) with a 2D baseline model, enhanced by the cutting-edge Cat Swarm Optimization (CSO) algorithm, to significantly improve APT detection accuracy. By seamlessly integrating the 2D-CNN baseline model with CSO, we unlock the potential for unprecedented accuracy and efficiency in APT detection. The results unveil an impressive accuracy score of $98.4\%$, marking a significant enhancement in APT detection across various attack stages, illuminating a path forward in combating these relentless and sophisticated threats.
comment: 6 pages, 5 figures
☆ Accelerating the discovery of steady-states of planetary interior dynamics with machine learning
Simulating mantle convection often requires reaching a computationally expensive steady-state, crucial for deriving scaling laws for thermal and dynamical flow properties and benchmarking numerical solutions. The strong temperature dependence of the rheology of mantle rocks causes viscosity variations of several orders of magnitude, leading to a slow-evolving stagnant lid where heat conduction dominates, overlying a rapidly-evolving and strongly convecting region. Time-stepping methods, while effective for fluids with constant viscosity, are hindered by the Courant criterion, which restricts the time step based on the system's maximum velocity and grid size. Consequently, achieving steady-state requires a large number of time steps due to the disparate time scales governing the stagnant and convecting regions. We present a concept for accelerating mantle convection simulations using machine learning. We generate a dataset of 128 two-dimensional simulations with mixed basal and internal heating, and pressure- and temperature-dependent viscosity. We train a feedforward neural network on 97 simulations to predict steady-state temperature profiles. These can then be used to initialize numerical time stepping methods for different simulation parameters. Compared to typical initializations, the number of time steps required to reach steady-state is reduced by a median factor of 3.75. The benefit of this method lies in requiring very few simulations to train on, providing a solution with no prediction error as we initialize a numerical method, and posing minimal computational overhead at inference time. We demonstrate the effectiveness of our approach and discuss the potential implications for accelerated simulations for advancing mantle convection research.
☆ Stationary Policies are Optimal in Risk-averse Total-reward MDPs with EVaR
Optimizing risk-averse objectives in discounted MDPs is challenging because most models do not admit direct dynamic programming equations and require complex history-dependent policies. In this paper, we show that the risk-averse {\em total reward criterion}, under the Entropic Risk Measure (ERM) and Entropic Value at Risk (EVaR) risk measures, can be optimized by a stationary policy, making it simple to analyze, interpret, and deploy. We propose exponential value iteration, policy iteration, and linear programming to compute optimal policies. In comparison with prior work, our results only require the relatively mild condition of transient MDPs and allow for {\em both} positive and negative rewards. Our results indicate that the total reward criterion may be preferable to the discounted criterion in a broad range of risk-averse reinforcement learning domains.
☆ Image-Perfect Imperfections: Safety, Bias, and Authenticity in the Shadow of Text-To-Image Model Evolution
Text-to-image models, such as Stable Diffusion (SD), undergo iterative updates to improve image quality and address concerns such as safety. Improvements in image quality are straightforward to assess. However, how model updates resolve existing concerns and whether they raise new questions remain unexplored. This study takes an initial step in investigating the evolution of text-to-image models from the perspectives of safety, bias, and authenticity. Our findings, centered on Stable Diffusion, indicate that model updates paint a mixed picture. While updates progressively reduce the generation of unsafe images, the bias issue, particularly in gender, intensifies. We also find that negative stereotypes either persist within the same Non-White race group or shift towards other Non-White race groups through SD updates, yet with minimal association of these traits with the White race group. Additionally, our evaluation reveals a new concern stemming from SD updates: State-of-the-art fake image detectors, initially trained for earlier SD versions, struggle to identify fake images generated by updated versions. We show that fine-tuning these detectors on fake images generated by updated versions achieves at least 96.6\% accuracy across various SD versions, addressing this issue. Our insights highlight the importance of continued efforts to mitigate biases and vulnerabilities in evolving text-to-image models.
comment: To Appear in the ACM Conference on Computer and Communications Security, October 14-18, 2024
☆ Minimax and Communication-Efficient Distributed Best Subset Selection with Oracle Property
The explosion of large-scale data in fields such as finance, e-commerce, and social media has outstripped the processing capabilities of single-machine systems, driving the need for distributed statistical inference methods. Traditional approaches to distributed inference often struggle with achieving true sparsity in high-dimensional datasets and involve high computational costs. We propose a novel, two-stage, distributed best subset selection algorithm to address these issues. Our approach starts by efficiently estimating the active set while adhering to the $\ell_0$ norm-constrained surrogate likelihood function, effectively reducing dimensionality and isolating key variables. A refined estimation within the active set follows, ensuring sparse estimates and matching the minimax $\ell_2$ error bound. We introduce a new splicing technique for adaptive parameter selection to tackle subproblems under $\ell_0$ constraints and a Generalized Information Criterion (GIC). Our theoretical and numerical studies show that the proposed algorithm correctly finds the true sparsity pattern, has the oracle property, and greatly lowers communication costs. This is a big step forward in distributed sparse estimation.
☆ The Transferability of Downsampling Sparse Graph Convolutional Networks
In this paper, we propose a large-scale sparse graph downsampling method based on a sparse random graph model, which allows for the adjustment of different sparsity levels. We combine sparsity and topological similarity: the sparse graph model reduces the node connection probability as the graph size increases, while the downsampling method preserves a specific topological connection pattern during this change. Based on the downsampling method, we derive a theoretical transferability bound about downsampling sparse graph convolutional networks (GCNs), that higher sampling rates, greater average degree expectations, and smaller initial graph sizes lead to better downsampling transferability performance.
☆ Equation identification for fluid flows via physics-informed neural networks ICML 2024
Scientific machine learning (SciML) methods such as physics-informed neural networks (PINNs) are used to estimate parameters of interest from governing equations and small quantities of data. However, there has been little work in assessing how well PINNs perform for inverse problems across wide ranges of governing equations across the mathematical sciences. We present a new and challenging benchmark problem for inverse PINNs based on a parametric sweep of the 2D Burgers' equation with rotational flow. We show that a novel strategy that alternates between first- and second-order optimization proves superior to typical first-order strategies for estimating parameters. In addition, we propose a novel data-driven method to characterize PINN effectiveness in the inverse setting. PINNs' physics-informed regularization enables them to leverage small quantities of data more efficiently than the data-driven baseline. However, both PINNs and the baseline can fail to recover parameters for highly inviscid flows, motivating the need for further development of PINN methods.
comment: Published at ICML 2024 AI4Science: https://openreview.net/forum?id=XsvCLEYH3O
☆ Joint Estimation and Prediction of City-wide Delivery Demand: A Large Language Model Empowered Graph-based Learning Approach
The proliferation of e-commerce and urbanization has significantly intensified delivery operations in urban areas, boosting the volume and complexity of delivery demand. Data-driven predictive methods, especially those utilizing machine learning techniques, have emerged to handle these complexities in urban delivery demand management problems. One particularly pressing problem that has not yet been sufficiently studied is the joint estimation and prediction of city-wide delivery demand. To this end, we formulate this problem as a graph-based spatiotemporal learning task. First, a message-passing neural network model is formalized to capture the interaction between demand patterns of associated regions. Second, by exploiting recent advances in large language models, we extract general geospatial knowledge encodings from the unstructured locational data and integrate them into the demand predictor. Last, to encourage the cross-city transferability of the model, an inductive training scheme is developed in an end-to-end routine. Extensive empirical results on two real-world delivery datasets, including eight cities in China and the US, demonstrate that our model significantly outperforms state-of-the-art baselines in these challenging tasks.
Self-supervised learning for crystal property prediction via denoising ICML 2024
Accurate prediction of the properties of crystalline materials is crucial for targeted discovery, and this prediction is increasingly done with data-driven models. However, for many properties of interest, the number of materials for which a specific property has been determined is much smaller than the number of known materials. To overcome this disparity, we propose a novel self-supervised learning (SSL) strategy for material property prediction. Our approach, crystal denoising self-supervised learning (CDSSL), pretrains predictive models (e.g., graph networks) with a pretext task based on recovering valid material structures when given perturbed versions of these structures. We demonstrate that CDSSL models out-perform models trained without SSL, across material types, properties, and dataset sizes.
comment: Published at ICML 2024 AI4Science: https://openreview.net/forum?id=yML9ufAEoV
☆ Learning and Verifying Maximal Taylor-Neural Lyapunov functions
We introduce a novel neural network architecture, termed Taylor-neural Lyapunov functions, designed to approximate Lyapunov functions with formal certification. This architecture innovatively encodes local approximations and extends them globally by leveraging neural networks to approximate the residuals. Our method recasts the problem of estimating the largest region of attraction - specifically for maximal Lyapunov functions - into a learning problem, ensuring convergence around the origin through robust control theory. Physics-informed machine learning techniques further refine the estimation of the largest region of attraction. Remarkably, this method is versatile, operating effectively even without simulated data points. We validate the efficacy of our approach by providing numerical certificates of convergence across multiple examples. Our proposed methodology not only competes closely with state-of-the-art approaches, such as sum-of-squares and LyZNet, but also achieves comparable results even in the absence of simulated data. This work represents a significant advancement in control theory, with broad potential applications in the design of stable control systems and beyond.
☆ Categorical data clustering: 25 years beyond K-modes
The clustering of categorical data is a common and important task in computer science, offering profound implications across a spectrum of applications. Unlike purely numerical datasets, categorical data often lack inherent ordering as in nominal data, or have varying levels of order as in ordinal data, thus requiring specialized methodologies for efficient organization and analysis. This review provides a comprehensive synthesis of categorical data clustering in the past twenty-five years, starting from the introduction of K-modes. It elucidates the pivotal role of categorical data clustering in diverse fields such as health sciences, natural sciences, social sciences, education, engineering and economics. Practical comparisons are conducted for algorithms having public implementations, highlighting distinguishing clustering methodologies and revealing the performance of recent algorithms on several benchmark categorical datasets. Finally, challenges and opportunities in the field are discussed.
☆ Using Quantum Solved Deep Boltzmann Machines to Increase the Data Efficiency of RL Agents
Deep Learning algorithms, such as those used in Reinforcement Learning, often require large quantities of data to train effectively. In most cases, the availability of data is not a significant issue. However, for some contexts, such as in autonomous cyber defence, we require data efficient methods. Recently, Quantum Machine Learning and Boltzmann Machines have been proposed as solutions to this challenge. In this work we build upon the pre-existing work to extend the use of Deep Boltzmann Machines to the cutting edge algorithm Proximal Policy Optimisation in a Reinforcement Learning cyber defence environment. We show that this approach, when solved using a D-WAVE quantum annealer, can lead to a two-fold increase in data efficiency. We therefore expect it to be used by the machine learning and quantum communities who are hoping to capitalise on data-efficient Reinforcement Learning methods.
☆ AI-Driven Intrusion Detection Systems (IDS) on the ROAD dataset: A Comparative Analysis for automotive Controller Area Network (CAN)
The integration of digital devices in modern vehicles has revolutionized automotive technology, enhancing safety and the overall driving experience. The Controller Area Network (CAN) bus is a central system for managing in-vehicle communication between the electronic control units (ECUs). However, the CAN protocol poses security challenges due to inherent vulnerabilities, lacking encryption and authentication, which, combined with an expanding attack surface, necessitates robust security measures. In response to this challenge, numerous Intrusion Detection Systems (IDS) have been developed and deployed. Nonetheless, an open, comprehensive, and realistic dataset to test the effectiveness of such IDSs remains absent in the existing literature. This paper addresses this gap by considering the latest ROAD dataset, containing stealthy and sophisticated injections. The methodology involves dataset labelling and the implementation of both state-of-the-art deep learning models and traditional machine learning models to show the discrepancy in performance between the datasets most commonly used in the literature and the ROAD dataset, a more realistic alternative.
☆ Geometry of Lightning Self-Attention: Identifiability and Dimension
We consider function spaces defined by self-attention networks without normalization, and theoretically analyze their geometry. Since these networks are polynomial, we rely on tools from algebraic geometry. In particular, we study the identifiability of deep attention by providing a description of the generic fibers of the parametrization for an arbitrary number of layers and, as a consequence, compute the dimension of the function space. Additionally, for a single-layer model, we characterize the singular and boundary points. Finally, we formulate a conjectural extension of our results to normalized self-attention networks, prove it for a single layer, and numerically verify it in the deep case.
☆ Democratizing AI in Africa: FL for Low-Resource Edge Devices
Africa faces significant challenges in healthcare delivery due to limited infrastructure and access to advanced medical technologies. This study explores the use of federated learning to overcome these barriers, focusing on perinatal health. We trained a fetal plane classifier using perinatal data from five African countries: Algeria, Ghana, Egypt, Malawi, and Uganda, along with data from Spanish hospitals. To incorporate the lack of computational resources in the analysis, we considered a heterogeneous set of devices, including a Raspberry Pi and several laptops, for model training. We demonstrate comparative performance between a centralized and a federated model, despite the compute limitations, and a significant improvement in model generalizability when compared to models trained only locally. These results show the potential for a future implementation at a large scale of a federated learning platform to bridge the accessibility gap and improve model generalizability with very little requirements.
☆ Towards Symbolic XAI -- Explanation Through Human Understandable Logical Relationships Between Features
Explainable Artificial Intelligence (XAI) plays a crucial role in fostering transparency and trust in AI systems, where traditional XAI approaches typically offer one level of abstraction for explanations, often in the form of heatmaps highlighting single or multiple input features. However, we ask whether abstract reasoning or problem-solving strategies of a model may also be relevant, as these align more closely with how humans approach solutions to problems. We propose a framework, called Symbolic XAI, that attributes relevance to symbolic queries expressing logical relationships between input features, thereby capturing the abstract reasoning behind a model's predictions. The methodology is built upon a simple yet general multi-order decomposition of model predictions. This decomposition can be specified using higher-order propagation-based relevance methods, such as GNN-LRP, or perturbation-based explanation methods commonly used in XAI. The effectiveness of our framework is demonstrated in the domains of natural language processing (NLP), vision, and quantum chemistry (QC), where abstract symbolic domain knowledge is abundant and of significant interest to users. The Symbolic XAI framework provides an understanding of the model's decision-making process that is both flexible for customization by the user and human-readable through logical formulas.
☆ Short-term Wind Speed Forecasting for Power Integration in Smart Grids based on Hybrid LSSVM-SVMD Method
Owing to its minimal pollution and efficient energy use, wind energy has become one of the most widely exploited renewable energy resources. The successful integration of wind power into the grid system is contingent upon accurate wind speed forecasting models. However, the task of wind speed forecasting is challenging due to the inherent intermittent characteristics of wind speed. In this paper, a hybrid machine learning approach is developed for predicting short-term wind speed. First, the wind data was decomposed into modal components using Successive Variational Mode Decomposition (SVMD). Then, each sub-signal was fitted into a Least Squares Support Vector Machines (LSSVM) model, with its hyperparameter optimized by a novel variant of Quantum-behaved Particle Swarm Optimization (QPSO), QPSO with elitist breeding (EBQPSO). Second, the residuals making up for the differences between the original wind series and the aggregate of the SVMD modes were modeled using long short-term model (LSTM). Then, the overall predicted values were computed using the aggregate of the LSSVM and the LSTM models. Finally, the performance of the proposed model was compared against state-of-the-art benchmark models for forecasting wind speed using two separate data sets collected from a local wind farm. Empirical results show significant improvement in performance by the proposed method, achieving a 1.21% to 32.76% reduction in root mean square error (RMSE) and a 2.05% to 40.75% reduction in mean average error (MAE) compared to the benchmark methods. The entire code implementation of this work is freely available in Github.
☆ Identifying and Clustering Counter Relationships of Team Compositions in PvP Games for Efficient Balance Analysis
How can balance be quantified in game settings? This question is crucial for game designers, especially in player-versus-player (PvP) games, where analyzing the strength relations among predefined team compositions-such as hero combinations in multiplayer online battle arena (MOBA) games or decks in card games-is essential for enhancing gameplay and achieving balance. We have developed two advanced measures that extend beyond the simplistic win rate to quantify balance in zero-sum competitive scenarios. These measures are derived from win value estimations, which employ strength rating approximations via the Bradley-Terry model and counter relationship approximations via vector quantization, significantly reducing the computational complexity associated with traditional win value estimations. Throughout the learning process of these models, we identify useful categories of compositions and pinpoint their counter relationships, aligning with the experiences of human players without requiring specific game knowledge. Our methodology hinges on a simple technique to enhance codebook utilization in discrete representation with a deterministic vector quantization process for an extremely small state space. Our framework has been validated in popular online games, including Age of Empires II, Hearthstone, Brawl Stars, and League of Legends. The accuracy of the observed strength relations in these games is comparable to traditional pairwise win value predictions, while also offering a more manageable complexity for analysis. Ultimately, our findings contribute to a deeper understanding of PvP game dynamics and present a methodology that significantly improves game balance evaluation and design.
comment: TMLR 09/2024 https://openreview.net/forum?id=2D36otXvBE
☆ SafeTail: Efficient Tail Latency Optimization in Edge Service Scheduling via Computational Redundancy Management
Optimizing tail latency while efficiently managing computational resources is crucial for delivering high-performance, latency-sensitive services in edge computing. Emerging applications, such as augmented reality, require low-latency computing services with high reliability on user devices, which often have limited computational capabilities. Consequently, these devices depend on nearby edge servers for processing. However, inherent uncertainties in network and computation latencies stemming from variability in wireless networks and fluctuating server loads make service delivery on time challenging. Existing approaches often focus on optimizing median latency but fall short of addressing the specific challenges of tail latency in edge environments, particularly under uncertain network and computational conditions. Although some methods do address tail latency, they typically rely on fixed or excessive redundancy and lack adaptability to dynamic network conditions, often being designed for cloud environments rather than the unique demands of edge computing. In this paper, we introduce SafeTail, a framework that meets both median and tail response time targets, with tail latency defined as latency beyond the 90^th percentile threshold. SafeTail addresses this challenge by selectively replicating services across multiple edge servers to meet target latencies. SafeTail employs a reward-based deep learning framework to learn optimal placement strategies, balancing the need to achieve target latencies with minimizing additional resource usage. Through trace-driven simulations, SafeTail demonstrated near-optimal performance and outperformed most baseline strategies across three diverse services.
comment: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible
☆ Learning Multi-Target TDOA Features for Sound Event Localization and Detection
Sound event localization and detection (SELD) systems using audio recordings from a microphone array rely on spatial cues for determining the location of sound events. As a consequence, the localization performance of such systems is to a large extent determined by the quality of the audio features that are used as inputs to the system. We propose a new feature, based on neural generalized cross-correlations with phase-transform (NGCC-PHAT), that learns audio representations suitable for localization. Using permutation invariant training for the time-difference of arrival (TDOA) estimation problem enables NGCC-PHAT to learn TDOA features for multiple overlapping sound events. These features can be used as a drop-in replacement for GCC-PHAT inputs to a SELD-network. We test our method on the STARSS23 dataset and demonstrate improved localization performance compared to using standard GCC-PHAT or SALSA-Lite input features.
comment: DCASE 2024
☆ Efficient Testable Learning of General Halfspaces with Adversarial Label Noise COLT'24
We study the task of testable learning of general -- not necessarily homogeneous -- halfspaces with adversarial label noise with respect to the Gaussian distribution. In the testable learning framework, the goal is to develop a tester-learner such that if the data passes the tester, then one can trust the output of the robust learner on the data.Our main result is the first polynomial time tester-learner for general halfspaces that achieves dimension-independent misclassification error. At the heart of our approach is a new methodology to reduce testable learning of general halfspaces to testable learning of nearly homogeneous halfspaces that may be of broader interest.
comment: Presented to COLT'24
☆ The Iterative Optimal Brain Surgeon: Faster Sparse Recovery by Leveraging Second-Order Information
The rising footprint of machine learning has led to a focus on imposing \emph{model sparsity} as a means of reducing computational and memory costs. For deep neural networks (DNNs), the state-of-the-art accuracy-vs-sparsity is achieved by heuristics inspired by the classical Optimal Brain Surgeon (OBS) framework~\citep{lecun90brain, hassibi1992second, hassibi1993optimal}, which leverages loss curvature information to make better pruning decisions. Yet, these results still lack a solid theoretical understanding, and it is unclear whether they can be improved by leveraging connections to the wealth of work on sparse recovery algorithms. In this paper, we draw new connections between these two areas and present new sparse recovery algorithms inspired by the OBS framework that comes with theoretical guarantees under reasonable assumptions and have strong practical performance. Specifically, our work starts from the observation that we can leverage curvature information in OBS-like fashion upon the projection step of classic iterative sparse recovery algorithms such as IHT. We show for the first time that this leads both to improved convergence bounds under standard assumptions. Furthermore, we present extensions of this approach to the practical task of obtaining accurate sparse DNNs, and validate it experimentally at scale for Transformer-based models on vision and language tasks.
☆ Deep Feature Embedding for Tabular Data ICONIP 2024
Tabular data learning has extensive applications in deep learning but its existing embedding techniques are limited in numerical and categorical features such as the inability to capture complex relationships and engineering. This paper proposes a novel deep embedding framework with leverages lightweight deep neural networks to generate effective feature embeddings for tabular data in machine learning research. For numerical features, a two-step feature expansion and deep transformation technique is used to capture copious semantic information. For categorical features, a unique identification vector for each entity is referred by a compact lookup table with a parameterized deep embedding function to uniform the embedding size dimensions, and transformed into a embedding vector using deep neural network. Experiments are conducted on real-world datasets for performance evaluation.
comment: 15 pages, 2figures, accepted to ICONIP 2024, Paper ID: 1399
☆ Investigating Privacy Leakage in Dimensionality Reduction Methods via Reconstruction Attack
This study investigates privacy leakage in dimensionality reduction methods through a novel machine learning-based reconstruction attack. Employing an \emph{informed adversary} threat model, we develop a neural network capable of reconstructing high-dimensional data from low-dimensional embeddings. We evaluate six popular dimensionality reduction techniques: PCA, sparse random projection (SRP), multidimensional scaling (MDS), Isomap, $t$-SNE, and UMAP. Using both MNIST and NIH Chest X-ray datasets, we perform a qualitative analysis to identify key factors affecting reconstruction quality. Furthermore, we assess the effectiveness of an additive noise mechanism in mitigating these reconstruction attacks.
☆ The Many Faces of Optimal Weak-to-Strong Learning
Boosting is an extremely successful idea, allowing one to combine multiple low accuracy classifiers into a much more accurate voting classifier. In this work, we present a new and surprisingly simple Boosting algorithm that obtains a provably optimal sample complexity. Sample optimal Boosting algorithms have only recently been developed, and our new algorithm has the fastest runtime among all such algorithms and is the simplest to describe: Partition your training data into 5 disjoint pieces of equal size, run AdaBoost on each, and combine the resulting classifiers via a majority vote. In addition to this theoretical contribution, we also perform the first empirical comparison of the proposed sample optimal Boosting algorithms. Our pilot empirical study suggests that our new algorithm might outperform previous algorithms on large data sets.
☆ Towards Hyper-parameter-free Federated Learning
The adaptive synchronization techniques in federated learning (FL) for scaled global model updates show superior performance over the vanilla federated averaging (FedAvg) scheme. However, existing methods employ additional tunable hyperparameters on the server to determine the scaling factor. A contrasting approach is automated scaling analogous to tuning-free step-size schemes in stochastic gradient descent (SGD) methods, which offer competitive convergence rates and exhibit good empirical performance. In this work, we introduce two algorithms for automated scaling of global model updates. In our first algorithm, we establish that a descent-ensuring step-size regime at the clients ensures descent for the server objective. We show that such a scheme enables linear convergence for strongly convex federated objectives. Our second algorithm shows that the average of objective values of sampled clients is a practical and effective substitute for the objective function value at the server required for computing the scaling factor, whose computation is otherwise not permitted. Our extensive empirical results show that the proposed methods perform at par or better than the popular federated learning algorithms for both convex and non-convex problems. Our work takes a step towards designing hyper-parameter-free federated learning.
comment: 28 pages, 3 figures
☆ Flow Matching for Optimal Reaction Coordinates of Biomolecular System
We present Flow Matching for Reaction Coordinates (FMRC), a novel deep learning algorithm designed to identify optimal reaction coordinates (RC) in biomolecular reversible dynamics. FMRC is based on the mathematical principles of lumpability and decomposability, which we reformulate into a conditional probability framework for efficient data-driven optimization using deep generative models. While FMRC does not explicitly learn the well-established transfer operator or its eigenfunctions, it can effectively encode the dynamics of leading eigenfunctions of the system transfer operator into its low-dimensional RC space. We further quantitatively compare its performance with several state-of-the-art algorithms by evaluating the quality of Markov State Models (MSM) constructed in their respective RC spaces, demonstrating the superiority of FMRC in three increasingly complex biomolecular systems. Finally, we discuss its potential applications in downstream applications such as enhanced sampling methods and MSM construction.
☆ Controllable Edge-Type-Specific Interpretation in Multi-Relational Graph Neural Networks for Drug Response Prediction
Graph Neural Networks have been widely applied in critical decision-making areas that demand interpretable predictions, leading to the flourishing development of interpretability algorithms. However, current graph interpretability algorithms tend to emphasize generality and often overlook biological significance, thereby limiting their applicability in predicting cancer drug responses. In this paper, we propose a novel post-hoc interpretability algorithm for cancer drug response prediction, CETExplainer, which incorporates a controllable edge-type-specific weighting mechanism. It considers the mutual information between subgraphs and predictions, proposing a structural scoring approach to provide fine-grained, biologically meaningful explanations for predictive models. We also introduce a method for constructing ground truth based on real-world datasets to quantitatively evaluate the proposed interpretability algorithm. Empirical analysis on the real-world dataset demonstrates that CETExplainer achieves superior stability and improves explanation quality compared to leading algorithms, thereby offering a robust and insightful tool for cancer drug prediction.
☆ Efficient Estimation of Unique Components in Independent Component Analysis by Matrix Representation
Independent component analysis (ICA) is a widely used method in various applications of signal processing and feature extraction. It extends principal component analysis (PCA) and can extract important and complicated components with small variances. One of the major problems of ICA is that the uniqueness of the solution is not guaranteed, unlike PCA. That is because there are many local optima in optimizing the objective function of ICA. It has been shown previously that the unique global optimum of ICA can be estimated from many random initializations by handcrafted thread computation. In this paper, the unique estimation of ICA is highly accelerated by reformulating the algorithm in matrix representation and reducing redundant calculations. Experimental results on artificial datasets and EEG data verified the efficiency of the proposed method.
☆ Sparse Uncertainty-Informed Sampling from Federated Streaming Data
We present a numerically robust, computationally efficient approach for non-I.I.D. data stream sampling in federated client systems, where resources are limited and labeled data for local model adaptation is sparse and expensive. The proposed method identifies relevant stream observations to optimize the underlying client model, given a local labeling budget, and performs instantaneous labeling decisions without relying on any memory buffering strategies. Our experiments show enhanced training batch diversity and an improved numerical robustness of the proposal compared to existing strategies over large-scale data streams, making our approach an effective and convenient solution in FL environments.
comment: Preprint, 6 pages, 3 figures, Accepted for ESANN 2024
☆ RISSOLE: Parameter-efficient Diffusion Models via Block-wise Generation and Retrieval-Guidance
Diffusion-based models demonstrate impressive generation capabilities. However, they also have a massive number of parameters, resulting in enormous model sizes, thus making them unsuitable for deployment on resource-constraint devices. Block-wise generation can be a promising alternative for designing compact-sized (parameter-efficient) deep generative models since the model can generate one block at a time instead of generating the whole image at once. However, block-wise generation is also considerably challenging because ensuring coherence across generated blocks can be non-trivial. To this end, we design a retrieval-augmented generation (RAG) approach and leverage the corresponding blocks of the images retrieved by the RAG module to condition the training and generation stages of a block-wise denoising diffusion model. Our conditioning schemes ensure coherence across the different blocks during training and, consequently, during generation. While we showcase our approach using the latent diffusion model (LDM) as the base model, it can be used with other variants of denoising diffusion models. We validate the solution of the coherence problem through the proposed approach by reporting substantive experiments to demonstrate our approach's effectiveness in compact model size and excellent generation quality.
☆ FissionVAE: Federated Non-IID Image Generation with Latent Space and Decoder Decomposition
Federated learning is a machine learning paradigm that enables decentralized clients to collaboratively learn a shared model while keeping all the training data local. While considerable research has focused on federated image generation, particularly Generative Adversarial Networks, Variational Autoencoders have received less attention. In this paper, we address the challenges of non-IID (independently and identically distributed) data environments featuring multiple groups of images of different types. Specifically, heterogeneous data distributions can lead to difficulties in maintaining a consistent latent space and can also result in local generators with disparate texture features being blended during aggregation. We introduce a novel approach, FissionVAE, which decomposes the latent space and constructs decoder branches tailored to individual client groups. This method allows for customized learning that aligns with the unique data distributions of each group. Additionally, we investigate the incorporation of hierarchical VAE architectures and demonstrate the use of heterogeneous decoder architectures within our model. We also explore strategies for setting the latent prior distributions to enhance the decomposition process. To evaluate our approach, we assemble two composite datasets: the first combines MNIST and FashionMNIST; the second comprises RGB datasets of cartoon and human faces, wild animals, marine vessels, and remote sensing images of Earth. Our experiments demonstrate that FissionVAE greatly improves generation quality on these datasets compared to baseline federated VAE models.
☆ Instant Adversarial Purification with Adversarial Consistency Distillation
Neural networks, despite their remarkable performance in widespread applications, including image classification, are also known to be vulnerable to subtle adversarial noise. Although some diffusion-based purification methods have been proposed, for example, DiffPure, those methods are time-consuming. In this paper, we propose One Step Control Purification (OSCP), a diffusion-based purification model that can purify the adversarial image in one Neural Function Evaluation (NFE) in diffusion models. We use Latent Consistency Model (LCM) and ControlNet for our one-step purification. OSCP is computationally friendly and time efficient compared to other diffusion-based purification methods; we achieve defense success rate of 74.19\% on ImageNet, only requiring 0.1s for each purification. Moreover, there is a fundamental incongruence between consistency distillation and adversarial perturbation. To address this ontological dissonance, we propose Gaussian Adversarial Noise Distillation (GAND), a novel consistency distillation framework that facilitates a more nuanced reconciliation of the latent space dynamics, effectively bridging the natural and adversarial manifolds. Our experiments show that the GAND does not need a Full Fine Tune (FFT); PEFT, e.g., LoRA is sufficient.
☆ A Survey of the Self Supervised Learning Mechanisms for Vision Transformers
Deep supervised learning models require high volume of labeled data to attain sufficiently good results. Although, the practice of gathering and annotating such big data is costly and laborious. Recently, the application of self supervised learning (SSL) in vision tasks has gained significant attention. The intuition behind SSL is to exploit the synchronous relationships within the data as a form of self-supervision, which can be versatile. In the current big data era, most of the data is unlabeled, and the success of SSL thus relies in finding ways to improve this vast amount of unlabeled data available. Thus its better for deep learning algorithms to reduce reliance on human supervision and instead focus on self-supervision based on the inherent relationships within the data. With the advent of ViTs, which have achieved remarkable results in computer vision, it is crucial to explore and understand the various SSL mechanisms employed for training these models specifically in scenarios where there is less label data available. In this survey we thus develop a comprehensive taxonomy of systematically classifying the SSL techniques based upon their representations and pre-training tasks being applied. Additionally, we discuss the motivations behind SSL, review popular pre-training tasks, and highlight the challenges and advancements in this field. Furthermore, we present a comparative analysis of different SSL methods, evaluate their strengths and limitations, and identify potential avenues for future research.
comment: 34 Pages, 5 Figures, 7 Tables
☆ Estimating Conditional Average Treatment Effects via Sufficient Representation Learning
Estimating the conditional average treatment effects (CATE) is very important in causal inference and has a wide range of applications across many fields. In the estimation process of CATE, the unconfoundedness assumption is typically required to ensure the identifiability of the regression problems. When estimating CATE using high-dimensional data, there have been many variable selection methods and neural network approaches based on representation learning, while these methods do not provide a way to verify whether the subset of variables after dimensionality reduction or the learned representations still satisfy the unconfoundedness assumption during the estimation process, which can lead to ineffective estimates of the treatment effects. Additionally, these methods typically use data from only the treatment or control group when estimating the regression functions for each group. This paper proposes a novel neural network approach named \textbf{CrossNet} to learn a sufficient representation for the features, based on which we then estimate the CATE, where cross indicates that in estimating the regression functions, we used data from their own group as well as cross-utilized data from another group. Numerical simulations and empirical results demonstrate that our method outperforms the competitive approaches.
☆ Error-controlled non-additive interaction discovery in machine learning models
Machine learning (ML) models are powerful tools for detecting complex patterns within data, yet their "black box" nature limits their interpretability, hindering their use in critical domains like healthcare and finance. To address this challenge, interpretable ML methods have been developed to explain how features influence model predictions. However, these methods often focus on univariate feature importance, overlooking the complex interactions between features that ML models are capable of capturing. Recognizing this limitation, recent efforts have aimed to extend these methods to discover feature interactions, but existing approaches struggle with robustness and error control, especially under data perturbations. In this study, we introduce Diamond, a novel method for trustworthy feature interaction discovery. Diamond uniquely integrates the model-X knockoffs framework to control the false discovery rate (FDR), ensuring that the proportion of falsely discovered interactions remains low. We further address the challenges of using off-the-shelf interaction importance measures by proposing a calibration procedure that refines these measures to maintain the desired FDR. Diamond's applicability spans a wide range of ML models, including deep neural networks, tree-based models, and factorization-based models. Our empirical evaluations on both simulated and real datasets across various biomedical studies demonstrate Diamond's utility in enabling more reliable data-driven scientific discoveries. This method represents a significant step forward in the deployment of ML models for scientific innovation and hypothesis generation.
☆ Disease Classification and Impact of Pretrained Deep Convolution Neural Networks on Diverse Medical Imaging Datasets across Imaging Modalities
Imaging techniques such as Chest X-rays, whole slide images, and optical coherence tomography serve as the initial screening and detection for a wide variety of medical pulmonary and ophthalmic conditions respectively. This paper investigates the intricacies of using pretrained deep convolutional neural networks with transfer learning across diverse medical imaging datasets with varying modalities for binary and multiclass classification. We conducted a comprehensive performance analysis with ten network architectures and model families each with pretraining and random initialization. Our finding showed that the use of pretrained models as fixed feature extractors yields poor performance irrespective of the datasets. Contrary, histopathology microscopy whole slide images have better performance. It is also found that deeper and more complex architectures did not necessarily result in the best performance. This observation implies that the improvements in ImageNet are not parallel to the medical imaging tasks. Within a medical domain, the performance of the network architectures varies within model families with shifts in datasets. This indicates that the performance of models within a specific modality may not be conclusive for another modality within the same domain. This study provides a deeper understanding of the applications of deep learning techniques in medical imaging and highlights the impact of pretrained networks across different medical imaging datasets under five different experimental settings.
comment: 15 pages, 3 figures, 4 tables
☆ Improving Time Series Classification with Representation Soft Label Smoothing
Previous research has indicated that deep neural network based models for time series classification (TSC) tasks are prone to overfitting. This issue can be mitigated by employing strategies that prevent the model from becoming overly confident in its predictions, such as label smoothing and confidence penalty. Building upon the concept of label smoothing, we propose a novel approach to generate more reliable soft labels, which we refer to as representation soft label smoothing. We apply label smoothing, confidence penalty, and our method representation soft label smoothing to several TSC models and compare their performance with baseline method which only uses hard labels for training. Our results demonstrate that the use of these enhancement techniques yields competitive results compared to the baseline method. Importantly, our method demonstrates strong performance across models with varying structures and complexities.
comment: 14 pages,6 figures
☆ Evaluation of Table Representations to Answer Questions from Tables in Documents : A Case Study using 3GPP Specifications
With the ubiquitous use of document corpora for question answering, one important aspect which is especially relevant for technical documents is the ability to extract information from tables which are interspersed with text. The major challenge in this is that unlike free-flow text or isolated set of tables, the representation of a table in terms of what is a relevant chunk is not obvious. We conduct a series of experiments examining various representations of tabular data interspersed with text to understand the relative benefits of different representations. We choose a corpus of $3^{rd}$ Generation Partnership Project (3GPP) documents since they are heavily interspersed with tables. We create expert curated dataset of question answers to evaluate our approach. We conclude that row level representations with corresponding table header information being included in every cell improves the performance of the retrieval, thus leveraging the structural information present in the tabular data.
comment: 10 pages, 4 figures, 2 tables
☆ A Tighter Convergence Proof of Reverse Experience Replay
In reinforcement learning, Reverse Experience Replay (RER) is a recently proposed algorithm that attains better sample complexity than the classic experience replay method. RER requires the learning algorithm to update the parameters through consecutive state-action-reward tuples in reverse order. However, the most recent theoretical analysis only holds for a minimal learning rate and short consecutive steps, which converge slower than those large learning rate algorithms without RER. In view of this theoretical and empirical gap, we provide a tighter analysis that mitigates the limitation on the learning rate and the length of consecutive steps. Furthermore, we show theoretically that RER converges with a larger learning rate and a longer sequence.
comment: This paper is accepted at RLC 2024
☆ A Scalable k-Medoids Clustering via Whale Optimization Algorithm
Unsupervised clustering has emerged as a critical tool for uncovering hidden patterns and insights from vast, unlabeled datasets. However, traditional methods like Partitioning Around Medoids (PAM) struggle with scalability due to their quadratic computational complexity. To address this limitation, we introduce WOA-kMedoids, a novel unsupervised clustering method that incorporates the Whale Optimization Algorithm (WOA), a nature-inspired metaheuristic inspired by the hunting strategies of humpback whales. By optimizing centroid selection, WOA-kMedoids reduces computational complexity of the k-medoids algorithm from quadratic to near-linear with respect to the number of observations. This improvement in efficiency enables WOA-kMedoids to be scalable to large datasets while maintaining high clustering accuracy. We evaluated the performance of WOA-kMedoids on 25 diverse time series datasets from the UCR archive. Our empirical results demonstrate that WOA-kMedoids maintains clustering accuracy similar to PAM. While WOA-kMedoids exhibited slightly higher runtime than PAM on small datasets (less than 300 observations), it outperformed PAM in computational efficiency on larger datasets. The scalability of WOA-kMedoids, combined with its consistently high accuracy, positions it as a promising and practical choice for unsupervised clustering in big data applications. WOA-kMedoids has implications for efficient knowledge discovery in massive, unlabeled datasets across various domains.
comment: 11 pages, 2 figures
☆ From Model Explanation to Data Misinterpretation: Uncovering the Pitfalls of Post Hoc Explainers in Business Research
Machine learning models have been increasingly used in business research. However, most state-of-the-art machine learning models, such as deep neural networks and XGBoost, are black boxes in nature. Therefore, post hoc explainers that provide explanations for machine learning models by, for example, estimating numerical importance of the input features, have been gaining wide usage. Despite the intended use of post hoc explainers being explaining machine learning models, we found a growing trend in business research where post hoc explanations are used to draw inferences about the data. In this work, we investigate the validity of such use. Specifically, we investigate with extensive experiments whether the explanations obtained by the two most popular post hoc explainers, SHAP and LIME, provide correct information about the true marginal effects of X on Y in the data, which we call data-alignment. We then identify what factors influence the alignment of explanations. Finally, we propose a set of mitigation strategies to improve the data-alignment of explanations and demonstrate their effectiveness with real-world data in an econometric context. In spite of this effort, we nevertheless conclude that it is often not appropriate to infer data insights from post hoc explanations. We articulate appropriate alternative uses, the most important of which is to facilitate the proposition and subsequent empirical investigation of hypotheses. The ultimate goal of this paper is to caution business researchers against translating post hoc explanations of machine learning models into potentially false insights and understanding of data.
☆ The Sample-Communication Complexity Trade-off in Federated Q-Learning
We consider the problem of federated Q-learning, where $M$ agents aim to collaboratively learn the optimal Q-function of an unknown infinite-horizon Markov decision process with finite state and action spaces. We investigate the trade-off between sample and communication complexities for the widely used class of intermittent communication algorithms. We first establish the converse result, where it is shown that a federated Q-learning algorithm that offers any speedup with respect to the number of agents in the per-agent sample complexity needs to incur a communication cost of at least an order of $\frac{1}{1-\gamma}$ up to logarithmic factors, where $\gamma$ is the discount factor. We also propose a new algorithm, called Fed-DVR-Q, which is the first federated Q-learning algorithm to simultaneously achieve order-optimal sample and communication complexities. Thus, together these results provide a complete characterization of the sample-communication complexity trade-off in federated Q-learning.
☆ Training Ultra Long Context Language Model with Fully Pipelined Distributed Transformer
Large Language Models (LLMs) with long context capabilities are integral to complex tasks in natural language processing and computational biology, such as text generation and protein sequence analysis. However, training LLMs directly on extremely long contexts demands considerable GPU resources and increased memory, leading to higher costs and greater complexity. Alternative approaches that introduce long context capabilities via downstream finetuning or adaptations impose significant design limitations. In this paper, we propose Fully Pipelined Distributed Transformer (FPDT) for efficiently training long-context LLMs with extreme hardware efficiency. For GPT and Llama models, we achieve a 16x increase in sequence length that can be trained on the same hardware compared to current state-of-the-art solutions. With our dedicated sequence chunk pipeline design, we can now train 8B LLM with 2 million sequence length on only 4 GPUs, while also maintaining over 55% of MFU. Our proposed FPDT is agnostic to existing training techniques and is proven to work efficiently across different LLM models.
☆ Technical Report of HelixFold3 for Biomolecular Structure Prediction
The AlphaFold series has transformed protein structure prediction with remarkable accuracy, often matching experimental methods. AlphaFold2, AlphaFold-Multimer, and the latest AlphaFold3 represent significant strides in predicting single protein chains, protein complexes, and biomolecular structures. While AlphaFold2 and AlphaFold-Multimer are open-sourced, facilitating rapid and reliable predictions, AlphaFold3 remains partially accessible through a limited online server and has not been open-sourced, restricting further development. To address these challenges, the PaddleHelix team is developing HelixFold3, aiming to replicate AlphaFold3's capabilities. Using insights from previous models and extensive datasets, HelixFold3 achieves an accuracy comparable to AlphaFold3 in predicting the structures of conventional ligands, nucleic acids, and proteins. The initial release of HelixFold3 is available as open source on GitHub for academic research, promising to advance biomolecular research and accelerate discoveries. We also provide online service at PaddleHelix website at https://paddlehelix.baidu.com/app/all/helixfold3/forecast.
☆ Point Neuron Learning: A New Physics-Informed Neural Network Architecture
Machine learning and neural networks have advanced numerous research domains, but challenges such as large training data requirements and inconsistent model performance hinder their application in certain scientific problems. To overcome these challenges, researchers have investigated integrating physics principles into machine learning models, mainly through: (i) physics-guided loss functions, generally termed as physics-informed neural networks, and (ii) physics-guided architectural design. While both approaches have demonstrated success across multiple scientific disciplines, they have limitations including being trapped to a local minimum, poor interpretability, and restricted generalizability. This paper proposes a new physics-informed neural network (PINN) architecture that combines the strengths of both approaches by embedding the fundamental solution of the wave equation into the network architecture, enabling the learned model to strictly satisfy the wave equation. The proposed point neuron learning method can model an arbitrary sound field based on microphone observations without any dataset. Compared to other PINN methods, our approach directly processes complex numbers and offers better interpretability and generalizability. We evaluate the versatility of the proposed architecture by a sound field reconstruction problem in a reverberant environment. Results indicate that the point neuron method outperforms two competing methods and can efficiently handle noisy environments with sparse microphone observations.
comment: under the review process of EURASIP Journal on Audio, Speech, and Music Processing
☆ UserSumBench: A Benchmark Framework for Evaluating User Summarization Approaches
Large language models (LLMs) have shown remarkable capabilities in generating user summaries from a long list of raw user activity data. These summaries capture essential user information such as preferences and interests, and therefore are invaluable for LLM-based personalization applications, such as explainable recommender systems. However, the development of new summarization techniques is hindered by the lack of ground-truth labels, the inherent subjectivity of user summaries, and human evaluation which is often costly and time-consuming. To address these challenges, we introduce \UserSumBench, a benchmark framework designed to facilitate iterative development of LLM-based summarization approaches. This framework offers two key components: (1) A reference-free summary quality metric. We show that this metric is effective and aligned with human preferences across three diverse datasets (MovieLens, Yelp and Amazon Review). (2) A novel robust summarization method that leverages time-hierarchical summarizer and self-critique verifier to produce high-quality summaries while eliminating hallucination. This method serves as a strong baseline for further innovation in summarization techniques.
☆ Discovery of False Data Injection Schemes on Frequency Controllers with Reinforcement Learning
While inverter-based distributed energy resources (DERs) play a crucial role in integrating renewable energy into the power system, they concurrently diminish the grid's system inertia, elevating the risk of frequency instabilities. Furthermore, smart inverters, interfaced via communication networks, pose a potential vulnerability to cyber threats if not diligently managed. To proactively fortify the power grid against sophisticated cyber attacks, we propose to employ reinforcement learning (RL) to identify potential threats and system vulnerabilities. This study concentrates on analyzing adversarial strategies for false data injection, specifically targeting smart inverters involved in primary frequency control. Our findings demonstrate that an RL agent can adeptly discern optimal false data injection methods to manipulate inverter settings, potentially causing catastrophic consequences.
☆ An Empirical Study of Scaling Laws for Transfer
We present a limited empirical study of scaling laws for transfer learning in transformer models. More specifically, we examine a scaling law that incorporates a "transfer gap" term, indicating the effectiveness of pre-training on one distribution when optimizing for downstream performance on another distribution. When the transfer gap is low, pre-training is a cost-effective strategy for improving downstream performance. Conversely, when the gap is high, collecting high-quality fine-tuning data becomes relatively more cost effective. Fitting the scaling law to experiments from diverse datasets reveals significant variations in the transfer gap across distributions. In theory, the scaling law can inform optimal data allocation strategies and highlights how the scarcity of downstream data can bottleneck performance. Our findings contribute to a principled way to measure transfer learning efficiency and understand how data availability affects capabilities.
♻ ☆ Quantum Distance Approximation for Persistence Diagrams
Topological Data Analysis methods can be useful for classification and clustering tasks in many different fields as they can provide two dimensional persistence diagrams that summarize important information about the shape of potentially complex and high dimensional data sets. The space of persistence diagrams can be endowed with various metrics such as the Wasserstein distance which admit a statistical structure and allow to use these summaries for machine learning algorithms. However, computing the distance between two persistence diagrams involves finding an optimal way to match the points of the two diagrams and may not always be an easy task for classical computers. In this work we explore the potential of quantum computers to estimate the distance between persistence diagrams, in particular we propose variational quantum algorithms for the Wasserstein distance as well as the $d^{c}_{p}$ distance. Our implementation is a weighted version of the Quantum Approximate Optimization Algorithm that relies on control clauses to encode the constraints of the optimization problem.
comment: 39 pages, 12 figures, 2 tables, submitted to Journal of Physics: Complexity
♻ ☆ A Survey on Knowledge Editing of Neural Networks
Deep neural networks are becoming increasingly pervasive in academia and industry, matching and surpassing human performance on a wide variety of fields and related tasks. However, just as humans, even the largest artificial neural networks make mistakes, and once-correct predictions can become invalid as the world progresses in time. Augmenting datasets with samples that account for mistakes or up-to-date information has become a common workaround in practical applications. However, the well-known phenomenon of catastrophic forgetting poses a challenge in achieving precise changes in the implicitly memorized knowledge of neural network parameters, often requiring a full model re-training to achieve desired behaviors. That is expensive, unreliable, and incompatible with the current trend of large self-supervised pre-training, making it necessary to find more efficient and effective methods for adapting neural network models to changing data. To address this need, knowledge editing is emerging as a novel area of research that aims to enable reliable, data-efficient, and fast changes to a pre-trained target model, without affecting model behaviors on previously learned tasks. In this survey, we provide a brief review of this recent artificial intelligence field of research. We first introduce the problem of editing neural networks, formalize it in a common framework and differentiate it from more notorious branches of research such as continuous learning. Next, we provide a review of the most relevant knowledge editing approaches and datasets proposed so far, grouping works under four different families: regularization techniques, meta-learning, direct model editing, and architectural strategies. Finally, we outline some intersections with other fields of research and potential directions for future works.
♻ ☆ Evaluating Named Entity Recognition: A comparative analysis of mono- and multilingual transformer models on a novel Brazilian corporate earnings call transcripts dataset
Since 2018, when the Transformer architecture was introduced, Natural Language Processing has gained significant momentum with pre-trained Transformer-based models that can be fine-tuned for various tasks. Most models are pre-trained on large English corpora, making them less applicable to other languages, such as Brazilian Portuguese. In our research, we identified two models pre-trained in Brazilian Portuguese (BERTimbau and PTT5) and two multilingual models (mBERT and mT5). BERTimbau and mBERT use only the Encoder module, while PTT5 and mT5 use both the Encoder and Decoder. Our study aimed to evaluate their performance on a financial Named Entity Recognition (NER) task and determine the computational requirements for fine-tuning and inference. To this end, we developed the Brazilian Financial NER (BraFiNER) dataset, comprising sentences from Brazilian banks' earnings calls transcripts annotated using a weakly supervised approach. Additionally, we introduced a novel approach that reframes the token classification task as a text generation problem. After fine-tuning the models, we evaluated them using performance and error metrics. Our findings reveal that BERT-based models consistently outperform T5-based models. While the multilingual models exhibit comparable macro F1-scores, BERTimbau demonstrates superior performance over PTT5. In terms of error metrics, BERTimbau outperforms the other models. We also observed that PTT5 and mT5 generated sentences with changes in monetary and percentage values, highlighting the importance of accuracy and consistency in the financial domain. Our findings provide insights into the differing performance of BERT- and T5-based models for the NER task.
♻ ☆ Can We Remove the Square-Root in Adaptive Gradient Methods? A Second-Order Perspective ICML 2024
Adaptive gradient optimizers like Adam(W) are the default training algorithms for many deep learning architectures, such as transformers. Their diagonal preconditioner is based on the gradient outer product which is incorporated into the parameter update via a square root. While these methods are often motivated as approximate second-order methods, the square root represents a fundamental difference. In this work, we investigate how the behavior of adaptive methods changes when we remove the root, i.e., strengthen their second-order motivation. Surprisingly, we find that such square-root-free adaptive methods close the generalization gap to SGD on convolutional architectures, while maintaining their root-based counterpart's performance on transformers. The second-order perspective also has practical benefits for developing non-diagonal methods that can incorporate arbitrary curvature approximations through the concept of preconditioner invariance. In contrast to root-based methods like Shampoo, root-free counterparts work well and fast with half-precision since they do not require numerically unstable matrix root decompositions and inversions. Overall, our findings provide new insights into the development of adaptive methods and raise important questions regarding the overlooked role of adaptivity in their success. (experiment code: https://github.com/yorkerlin/remove-the-square-root optimizer code: https://github.com/f-dangel/sirfshampoo)
comment: A long version of the ICML 2024 paper. Added root-free update schemes for n-dim tensor cases
♻ ☆ Hoaxpedia: A Unified Wikipedia Hoax Articles Dataset
Hoaxes are a recognised form of disinformation created deliberately, with potential serious implications in the credibility of reference knowledge resources such as Wikipedia. What makes detecting Wikipedia hoaxes hard is that they often are written according to the official style guidelines. In this work, we first provide a systematic analysis of similarities and discrepancies between legitimate and hoax Wikipedia articles, and introduce Hoaxpedia, a collection of 311 hoax articles (from existing literature and official Wikipedia lists), together with semantically similar legitimate articles, which together form a binary text classification dataset aimed at fostering research in automated hoax detection. In this paper, We report results after analyzing several language models, hoax-to-legit ratios, and the amount of text classifiers are exposed to (full article vs the article's definition alone). Our results suggest that detecting deceitful content in Wikipedia based on content alone is hard but feasible, and complement our analysis with a study on the differences in distributions in edit histories, and find that looking at this feature yields better classification results than context.
♻ ☆ Recursive Estimation of Conditional Kernel Mean Embeddings
Kernel mean embeddings, a widely used technique in machine learning, map probability distributions to elements of a reproducing kernel Hilbert space (RKHS). For supervised learning problems, where input-output pairs are observed, the conditional distribution of outputs given the inputs is a key object. The input dependent conditional distribution of an output can be encoded with an RKHS valued function, the conditional kernel mean map. In this paper we present a new recursive algorithm to estimate the conditional kernel mean map in a Hilbert space valued $L_2$ space, that is in a Bochner space. We prove the weak and strong $L_2$ consistency of our recursive estimator under mild conditions. The idea is to generalize Stone's theorem for Hilbert space valued regression in a locally compact Polish space. We present new insights about conditional kernel mean embeddings and give strong asymptotic bounds regarding the convergence of the proposed recursive method. Finally, the results are demonstrated on three application domains: for inputs coming from Euclidean spaces, Riemannian manifolds and locally compact subsets of function spaces.
♻ ☆ Complexity of High-Dimensional Identity Testing with Coordinate Conditional Sampling
We study the identity testing problem for high-dimensional distributions. Given as input an explicit distribution $\mu$, an $\varepsilon>0$, and access to sampling oracle(s) for a hidden distribution $\pi$, the goal in identity testing is to distinguish whether the two distributions $\mu$ and $\pi$ are identical or are at least $\varepsilon$-far apart. When there is only access to full samples from the hidden distribution $\pi$, it is known that exponentially many samples (in the dimension) may be needed for identity testing, and hence previous works have studied identity testing with additional access to various "conditional" sampling oracles. We consider a significantly weaker conditional sampling oracle, which we call the $\mathsf{Coordinate\ Oracle}$, and provide a computational and statistical characterization of the identity testing problem in this new model. We prove that if an analytic property known as approximate tensorization of entropy holds for an $n$-dimensional visible distribution $\mu$, then there is an efficient identity testing algorithm for any hidden distribution $\pi$ using $\tilde{O}(n/\varepsilon)$ queries to the $\mathsf{Coordinate\ Oracle}$. Approximate tensorization of entropy is a pertinent condition as recent works have established it for a large class of high-dimensional distributions. We also prove a computational phase transition: for a well-studied class of $n$-dimensional distributions, specifically sparse antiferromagnetic Ising models over $\{+1,-1\}^n$, we show that in the regime where approximate tensorization of entropy fails, there is no efficient identity testing algorithm unless $\mathsf{RP}=\mathsf{NP}$. We complement our results with a matching $\Omega(n/\varepsilon)$ statistical lower bound for the sample complexity of identity testing in the $\mathsf{Coordinate\ Oracle}$ model.
♻ ☆ Learning Dynamic Bayesian Networks from Data: Foundations, First Principles and Numerical Comparisons
In this paper, we present a guide to the foundations of learning Dynamic Bayesian Networks (DBNs) from data in the form of multiple samples of trajectories for some length of time. We present the formalism for a generic as well as a set of common types of DBNs for particular variable distributions. We present the analytical form of the models, with a comprehensive discussion on the interdependence between structure and weights in a DBN model and their implications for learning. Next, we give a broad overview of learning methods and describe and categorize them based on the most important statistical features, and how they treat the interplay between learning structure and weights. We give the analytical form of the likelihood and Bayesian score functions, emphasizing the distinction from the static case. We discuss functions used in optimization to enforce structural requirements. We briefly discuss more complex extensions and representations. Finally we present a set of comparisons in different settings for various distinct but representative algorithms across the variants.
♻ ☆ LightFF: Lightweight Inference for Forward-Forward Algorithm
The human brain performs tasks with an outstanding energy efficiency, i.e., with approximately 20 Watts. The state-of-the-art Artificial/Deep Neural Networks (ANN/DNN), on the other hand, have recently been shown to consume massive amounts of energy. The training of these ANNs/DNNs is done almost exclusively based on the back-propagation algorithm, which is known to be biologically implausible. This has led to a new generation of forward-only techniques, including the Forward-Forward algorithm. In this paper, we propose a lightweight inference scheme specifically designed for DNNs trained using the Forward-Forward algorithm. We have evaluated our proposed lightweight inference scheme in the case of the MNIST and CIFAR datasets, as well as two real-world applications, namely, epileptic seizure detection and cardiac arrhythmia classification using wearable technologies, where complexity overheads/energy consumption is a major constraint, and demonstrate its relevance. Our code is available at https://github.com/AminAminifar/LightFF.
♻ ☆ Learning the irreversible progression trajectory of Alzheimer's disease
Alzheimer's disease (AD) is a progressive and irreversible brain disorder that unfolds over the course of 30 years. Therefore, it is critical to capture the disease progression in an early stage such that intervention can be applied before the onset of symptoms. Machine learning (ML) models have been shown effective in predicting the onset of AD. Yet for subjects with follow-up visits, existing techniques for AD classification only aim for accurate group assignment, where the monotonically increasing risk across follow-up visits is usually ignored. Resulted fluctuating risk scores across visits violate the irreversibility of AD, hampering the trustworthiness of models and also providing little value to understanding the disease progression. To address this issue, we propose a novel regularization approach to predict AD longitudinally. Our technique aims to maintain the expected monotonicity of increasing disease risk during progression while preserving expressiveness. Specifically, we introduce a monotonicity constraint that encourages the model to predict disease risk in a consistent and ordered manner across follow-up visits. We evaluate our method using the longitudinal structural MRI and amyloid-PET imaging data from the Alzheimer's Disease Neuroimaging Initiative (ADNI). Our model outperforms existing techniques in capturing the progressiveness of disease risk, and at the same time preserves prediction accuracy.
comment: accepted by ISBI 2024
♻ ☆ Parameters Inference for Nonlinear Wave Equations with Markovian Switching
Traditional partial differential equations with constant coefficients often struggle to capture abrupt changes in real-world phenomena, leading to the development of variable coefficient PDEs and Markovian switching models. Recently, research has introduced the concept of PDEs with Markov switching models, established their well-posedness and presented numerical methods. However, there has been limited discussion on parameter estimation for the jump coefficients in these models. This paper addresses this gap by focusing on parameter inference for the wave equation with Markovian switching. We propose a Bayesian statistical framework using discrete sparse Bayesian learning to establish its convergence and a uniform error bound. Our method requires fewer assumptions and enables independent parameter inference for each segment by allowing different underlying structures for the parameter estimation problem within each segmented time interval. The effectiveness of our approach is demonstrated through three numerical cases, which involve noisy spatiotemporal data from different wave equations with Markovian switching. The results show strong performance in parameter estimation for variable coefficient PDEs.
♻ ☆ Effectiveness of probabilistic contact tracing in epidemic containment: the role of super-spreaders and transmission path reconstruction
The recent COVID-19 pandemic underscores the significance of early-stage non-pharmacological intervention strategies. The widespread use of masks and the systematic implementation of contact tracing strategies provide a potentially equally effective and socially less impactful alternative to more conventional approaches, such as large-scale mobility restrictions. However, manual contact tracing faces strong limitations in accessing the network of contacts, and the scalability of currently implemented protocols for smartphone-based digital contact tracing becomes impractical during the rapid expansion phases of the outbreaks, due to the surge in exposure notifications and associated tests. A substantial improvement in digital contact tracing can be obtained through the integration of probabilistic techniques for risk assessment that can more effectively guide the allocation of new diagnostic tests. In this study, we first quantitatively analyze the diagnostic and social costs associated with these containment measures based on contact tracing, employing three state-of-the-art models of SARS-CoV-2 spreading. Our results suggest that probabilistic techniques allow for more effective mitigation at a lower cost. Secondly, our findings reveal a remarkable efficacy of probabilistic contact-tracing techniques in performing backward and multi-step tracing and capturing super-spreading events.
♻ ☆ Foundational Models for Pathology and Endoscopy Images: Application for Gastric Inflammation
The integration of artificial intelligence (AI) in medical diagnostics represents a significant advancement in managing upper gastrointestinal (GI) cancer, a major cause of global cancer mortality. Specifically for gastric cancer (GC), chronic inflammation causes changes in the mucosa such as atrophy, intestinal metaplasia (IM), dysplasia and ultimately cancer. Early detection through endoscopic regular surveillance is essential for better outcomes. Foundation models (FM), which are machine or deep learning models trained on diverse data and applicable to broad use cases, offer a promising solution to enhance the accuracy of endoscopy and its subsequent pathology image analysis. This review explores the recent advancements, applications, and challenges associated with FM in endoscopy and pathology imaging. We started by elucidating the core principles and architectures underlying these models, including their training methodologies and the pivotal role of large-scale data in developing their predictive capabilities. Moreover, this work discusses emerging trends and future research directions, emphasizing the integration of multimodal data, the development of more robust and equitable models, and the potential for real-time diagnostic support. This review aims to provide a roadmap for researchers and practitioners in navigating the complexities of incorporating FM into clinical practice for prevention/management of GC cases, thereby improving patient outcomes.
♻ ☆ A Newton-CG based barrier-augmented Lagrangian method for general nonconvex conic optimization
In this paper we consider finding an approximate second-order stationary point (SOSP) of general nonconvex conic optimization that minimizes a twice differentiable function subject to nonlinear equality constraints and also a convex conic constraint. In particular, we propose a Newton-conjugate gradient (Newton-CG) based barrier-augmented Lagrangian method for finding an approximate SOSP of this problem. Under some mild assumptions, we show that our method enjoys a total inner iteration complexity of $\widetilde{\cal O}(\epsilon^{-11/2})$ and an operation complexity of $\widetilde{\cal O}(\epsilon^{-11/2}\min\{n,\epsilon^{-5/4}\})$ for finding an $(\epsilon,\sqrt{\epsilon})$-SOSP of general nonconvex conic optimization with high probability. Moreover, under a constraint qualification, these complexity bounds are improved to $\widetilde{\cal O}(\epsilon^{-7/2})$ and $\widetilde{\cal O}(\epsilon^{-7/2}\min\{n,\epsilon^{-3/4}\})$, respectively. To the best of our knowledge, this is the first study on the complexity of finding an approximate SOSP of general nonconvex conic optimization. Preliminary numerical results are presented to demonstrate superiority of the proposed method over first-order methods in terms of solution quality.
comment: To appear in Computational Optimization and Applications. arXiv admin note: text overlap with arXiv:2301.03139
♻ ☆ Invariant Causal Prediction with Local Models
We consider the task of identifying the causal parents of a target variable among a set of candidates from observational data. Our main assumption is that the candidate variables are observed in different environments which may, under certain assumptions, be regarded as interventions on the observed system. We assume a linear relationship between target and candidates, which can be different in each environment with the only restriction that the causal structure is invariant across environments. Within our proposed setting we provide sufficient conditions for identifiability of the causal parents and introduce a practical method called L-ICP ($\textbf{L}$ocalized $\textbf{I}$nvariant $\textbf{Ca}$usal $\textbf{P}$rediction), which is based on a hypothesis test for parent identification using a ratio of minimum and maximum statistics. We then show in a simplified setting that the statistical power of L-ICP converges exponentially fast in the sample size, and finally we analyze the behavior of L-ICP experimentally in more general settings.
♻ ☆ On the Curse of Memory in Recurrent Neural Networks: Approximation and Optimization Analysis
We study the approximation properties and optimization dynamics of recurrent neural networks (RNNs) when applied to learn input-output relationships in temporal data. We consider the simple but representative setting of using continuous-time linear RNNs to learn from data generated by linear relationships. Mathematically, the latter can be understood as a sequence of linear functionals. We prove a universal approximation theorem of such linear functionals, and characterize the approximation rate and its relation with memory. Moreover, we perform a fine-grained dynamical analysis of training linear RNNs, which further reveal the intricate interactions between memory and learning. A unifying theme uncovered is the non-trivial effect of memory, a notion that can be made precise in our framework, on approximation and optimization: when there is long term memory in the target, it takes a large number of neurons to approximate it. Moreover, the training process will suffer from slow downs. In particular, both of these effects become exponentially more pronounced with memory - a phenomenon we call the "curse of memory". These analyses represent a basic step towards a concrete mathematical understanding of new phenomenon that may arise in learning temporal relationships using recurrent architectures.
comment: Updated to include the condition $\sup_n \| \boldsymbol{x}(n) \|_{\mathcal{X}} \leq 1$ in the definition of regularity, which excludes the trivial case where only the zero functional is regular. Fixed various typos and improved clarity
♻ ☆ Wasserstein multivariate auto-regressive models for modeling distributional time series
This paper is focused on the statistical analysis of data consisting of a collection of multiple series of probability measures that are indexed by distinct time instants and supported over a bounded interval of the real line. By modeling these time-dependent probability measures as random objects in the Wasserstein space, we propose a new auto-regressive model for the statistical analysis of multivariate distributional time series. Using the theory of iterated random function systems, results on the existence, uniqueness and stationarity of the solution of such a model are provided. We also propose a consistent estimator for the auto-regressive coefficients of this model. Due to the simplex constraints that we impose on the model coefficients, the proposed estimator that is learned under these constraints, naturally has a sparse structure. The sparsity allows the application of the proposed model in learning a graph of temporal dependency from multivariate distributional time series. We explore the numerical performances of our estimation procedure using simulated data. To shed some light on the benefits of our approach for real data analysis, we also apply this methodology to a data set made of observations from age distribution in different countries.
♻ ☆ Scalable Multi-Agent Reinforcement Learning for Warehouse Logistics with Robotic and Human Co-Workers IROS
We consider a warehouse in which dozens of mobile robots and human pickers work together to collect and deliver items within the warehouse. The fundamental problem we tackle, called the order-picking problem, is how these worker agents must coordinate their movement and actions in the warehouse to maximise performance in this task. Established industry methods using heuristic approaches require large engineering efforts to optimise for innately variable warehouse configurations. In contrast, multi-agent reinforcement learning (MARL) can be flexibly applied to diverse warehouse configurations (e.g. size, layout, number/types of workers, item replenishment frequency), and different types of order-picking paradigms (e.g. Goods-to-Person and Person-to-Goods), as the agents can learn how to cooperate optimally through experience. We develop hierarchical MARL algorithms in which a manager agent assigns goals to worker agents, and the policies of the manager and workers are co-trained toward maximising a global objective (e.g. pick rate). Our hierarchical algorithms achieve significant gains in sample efficiency over baseline MARL algorithms and overall pick rates over multiple established industry heuristics in a diverse set of warehouse configurations and different order-picking paradigms.
comment: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2024
♻ ☆ A Deep-Learning Technique to Locate Cryptographic Operations in Side-Channel Traces DATE24
Side-channel attacks allow extracting secret information from the execution of cryptographic primitives by correlating the partially known computed data and the measured side-channel signal. However, to set up a successful side-channel attack, the attacker has to perform i) the challenging task of locating the time instant in which the target cryptographic primitive is executed inside a side-channel trace and then ii)the time-alignment of the measured data on that time instant. This paper presents a novel deep-learning technique to locate the time instant in which the target computed cryptographic operations are executed in the side-channel trace. In contrast to state-of-the-art solutions, the proposed methodology works even in the presence of trace deformations obtained through random delay insertion techniques. We validated our proposal through a successful attack against a variety of unprotected and protected cryptographic primitives that have been executed on an FPGA-implemented system-on-chip featuring a RISC-V CPU.
comment: 6 pages, 3 figures. Presented at DATE24
♻ ☆ Object-Centric Diffusion for Efficient Video Editing ECCV24
Diffusion-based video editing have reached impressive quality and can transform either the global style, local structure, and attributes of given video inputs, following textual edit prompts. However, such solutions typically incur heavy memory and computational costs to generate temporally-coherent frames, either in the form of diffusion inversion and/or cross-frame attention. In this paper, we conduct an analysis of such inefficiencies, and suggest simple yet effective modifications that allow significant speed-ups whilst maintaining quality. Moreover, we introduce Object-Centric Diffusion, to fix generation artifacts and further reduce latency by allocating more computations towards foreground edited regions, arguably more important for perceptual quality. We achieve this by two novel proposals: i) Object-Centric Sampling, decoupling the diffusion steps spent on salient or background regions and spending most on the former, and ii) Object-Centric Token Merging, which reduces cost of cross-frame attention by fusing redundant tokens in unimportant background regions. Both techniques are readily applicable to a given video editing model without retraining, and can drastically reduce its memory and computational cost. We evaluate our proposals on inversion-based and control-signal-based editing pipelines, and show a latency reduction up to 10x for a comparable synthesis quality. Project page: qualcomm-ai-research.github.io/object-centric-diffusion.
comment: ECCV24
♻ ☆ Fast Fishing: Approximating BAIT for Efficient and Scalable Deep Active Image Classification ECML
Deep active learning (AL) seeks to minimize the annotation costs for training deep neural networks. BAIT, a recently proposed AL strategy based on the Fisher Information, has demonstrated impressive performance across various datasets. However, BAIT's high computational and memory requirements hinder its applicability on large-scale classification tasks, resulting in current research neglecting BAIT in their evaluation. This paper introduces two methods to enhance BAIT's computational efficiency and scalability. Notably, we significantly reduce its time complexity by approximating the Fisher Information. In particular, we adapt the original formulation by i) taking the expectation over the most probable classes, and ii) constructing a binary classification task, leading to an alternative likelihood for gradient computations. Consequently, this allows the efficient use of BAIT on large-scale datasets, including ImageNet. Our unified and comprehensive evaluation across a variety of datasets demonstrates that our approximations achieve strong performance with considerably reduced time complexity. Furthermore, we provide an extensive open-source toolbox that implements recent state-of-the-art AL strategies, available at https://github.com/dhuseljic/dal-toolbox.
comment: Accepted at ECML PKDD 2024
♻ ☆ Incorporating Unlabelled Data into Bayesian Neural Networks
Conventional Bayesian Neural Networks (BNNs) are unable to leverage unlabelled data to improve their predictions. To overcome this limitation, we introduce Self-Supervised Bayesian Neural Networks, which use unlabelled data to learn models with suitable prior predictive distributions. This is achieved by leveraging contrastive pretraining techniques and optimising a variational lower bound. We then show that the prior predictive distributions of self-supervised BNNs capture problem semantics better than conventional BNN priors. In turn, our approach offers improved predictive performance over conventional BNNs, especially in low-budget regimes.
comment: Published in the Transactions on Machine Learning Research
♻ ☆ EUvsDisinfo: A Dataset for Multilingual Detection of Pro-Kremlin Disinformation in News Articles CIKM 2024
This work introduces EUvsDisinfo, a multilingual dataset of disinformation articles originating from pro-Kremlin outlets, along with trustworthy articles from credible / less biased sources. It is sourced directly from the debunk articles written by experts leading the EUvsDisinfo project. Our dataset is the largest to-date resource in terms of the overall number of articles and distinct languages. It also provides the largest topical and temporal coverage. Using this dataset, we investigate the dissemination of pro-Kremlin disinformation across different languages, uncovering language-specific patterns targeting certain disinformation topics. We further analyse the evolution of topic distribution over an eight-year period, noting a significant surge in disinformation content before the full-scale invasion of Ukraine in 2022. Lastly, we demonstrate the dataset's applicability in training models to effectively distinguish between disinformation and trustworthy content in multilingual settings.
comment: Published at CIKM 2024
♻ ☆ Jailbreak Attacks and Defenses Against Large Language Models: A Survey
Large Language Models (LLMs) have performed exceptionally in various text-generative tasks, including question answering, translation, code completion, etc. However, the over-assistance of LLMs has raised the challenge of "jailbreaking", which induces the model to generate malicious responses against the usage policy and society by designing adversarial prompts. With the emergence of jailbreak attack methods exploiting different vulnerabilities in LLMs, the corresponding safety alignment measures are also evolving. In this paper, we propose a comprehensive and detailed taxonomy of jailbreak attack and defense methods. For instance, the attack methods are divided into black-box and white-box attacks based on the transparency of the target model. Meanwhile, we classify defense methods into prompt-level and model-level defenses. Additionally, we further subdivide these attack and defense methods into distinct sub-classes and present a coherent diagram illustrating their relationships. We also conduct an investigation into the current evaluation methods and compare them from different perspectives. Our findings aim to inspire future research and practical implementations in safeguarding LLMs against adversarial attacks. Above all, although jailbreak remains a significant concern within the community, we believe that our work enhances the understanding of this domain and provides a foundation for developing more secure LLMs.
♻ ☆ Robust Statistical Scaling of Outlier Scores: Improving the Quality of Outlier Probabilities for Outliers (Extended Version)
Outlier detection algorithms typically assign an outlier score to each observation in a dataset, indicating the degree to which an observation is an outlier. However, these scores are often not comparable across algorithms and can be difficult for humans to interpret. Statistical scaling addresses this problem by transforming outlier scores into outlier probabilities without using ground-truth labels, thereby improving interpretability and comparability across algorithms. However, the quality of this transformation can be different for outliers and inliers. Missing outliers in scenarios where they are of particular interest - such as healthcare, finance, or engineering - can be costly or dangerous. Thus, ensuring good probabilities for outliers is essential. This paper argues that statistical scaling, as commonly used in the literature, does not produce equally good probabilities for outliers as for inliers. Therefore, we propose robust statistical scaling, which uses robust estimators to improve the probabilities for outliers. We evaluate several variants of our method against other outlier score transformations for real-world datasets and outlier detection algorithms, where it can improve the probabilities for outliers.
comment: 15 pages, 4 figures, extended version of an original article accepted for publication in SISAP 2024 by Springer Nature
♻ ☆ Lamarr: LHCb ultra-fast simulation based on machine learning models deployed within Gauss
About 90% of the computing resources available to the LHCb experiment has been spent to produce simulated data samples for Run 2 of the Large Hadron Collider at CERN. The upgraded LHCb detector will be able to collect larger data samples, requiring many more simulated events to analyze the data to be collected in Run 3. Simulation is a key necessity of analysis to interpret signal, reject background and measure efficiencies. The needed simulation will far exceed the pledged resources, requiring an evolution in technologies and techniques to produce these simulated data samples. In this contribution, we discuss Lamarr, a Gaudi-based framework to speed-up the simulation production parameterizing both the detector response and the reconstruction algorithms of the LHCb experiment. Deep Generative Models powered by several algorithms and strategies are employed to effectively parameterize the high-level response of the single components of the LHCb detector, encoding within neural networks the experimental errors and uncertainties introduced in the detection and reconstruction phases. Where possible, models are trained directly on real data, statistically subtracting any background components by applying appropriate reweighing procedures. Embedding Lamarr in the general LHCb Gauss Simulation framework allows to combine its execution with any of the available generators in a seamless way. The resulting software package enables a simulation process independent of the detailed simulation used to date.
comment: To be published in Journal of Physics: Conference Series (ACAT 2022)
♻ ☆ Non-Homophilic Graph Pre-Training and Prompt Learning
Graphs are ubiquitous for modeling complex relationships between objects across various fields. Graph neural networks (GNNs) have become a mainstream technique for graph-based applications, but their performance heavily relies on abundant labeled data. To reduce labeling requirement, pre-training and prompt learning has become a popular alternative. However, most existing prompt methods do not differentiate homophilic and heterophilic characteristics of real-world graphs. In particular, many real-world graphs are non-homophilic, not strictly or uniformly homophilic with mixing homophilic and heterophilic patterns, exhibiting varying non-homophilic characteristics across graphs and nodes. In this paper, we propose ProNoG, a novel pre-training and prompt learning framework for such non-homophilic graphs. First, we analyze existing graph pre-training methods, providing theoretical insights into the choice of pre-training tasks. Second, recognizing that each node exhibits unique non-homophilic characteristics, we propose a conditional network to characterize the node-specific patterns in downstream tasks. Finally, we thoroughly evaluate and analyze ProNoG through extensive experiments on ten public datasets.
comment: Under review
♻ ☆ DiffLoad: Uncertainty Quantification in Electrical Load Forecasting with Diffusion Model
Electrical load forecasting plays a crucial role in decision-making for power systems, including unit commitment and economic dispatch. The integration of renewable energy sources and the occurrence of external events, such as the COVID-19 pandemic, have rapidly increased uncertainties in load forecasting. The uncertainties in load forecasting can be divided into two types: epistemic uncertainty and aleatoric uncertainty. Separating these types of uncertainties can help decision-makers better understand where and to what extent the uncertainty is, thereby enhancing their confidence in the following decision-making. This paper proposes a diffusion-based Seq2Seq structure to estimate epistemic uncertainty and employs the robust additive Cauchy distribution to estimate aleatoric uncertainty. Our method not only ensures the accuracy of load forecasting but also demonstrates the ability to separate the two types of uncertainties and be applicable to different levels of loads. The relevant code can be found at \url{https://anonymous.4open.science/r/DiffLoad-4714/}.
comment: Accepted by IEEE Transactions on Power Systems, 2024
♻ ☆ Hyperparameter Optimization as a Service on INFN Cloud
The simplest and often most effective way of parallelizing the training of complex machine learning models is to execute several training instances on multiple machines, scanning the hyperparameter space to optimize the underlying statistical model and the learning procedure. Often, such a meta-learning procedure is limited by the ability of accessing securely a common database organizing the knowledge of the previous and ongoing trials. Exploiting opportunistic GPUs provided in different environments represents a further challenge when designing such optimization campaigns. In this contribution, we discuss how a set of REST APIs can be used to access a dedicated service based on INFN Cloud to monitor and coordinate multiple training instances, with gradient-less optimization techniques, via simple HTTP requests. The service, called Hopaas (Hyperparameter OPtimization As A Service), is made of a web interface and sets of APIs implemented with a FastAPI backend running through Uvicorn and NGINX in a virtual instance of INFN Cloud. The optimization algorithms are currently based on Bayesian techniques as provided by Optuna. A Python frontend is also made available for quick prototyping. We present applications to hyperparameter optimization campaigns performed by combining private, INFN Cloud, and CINECA resources. Such multi-node multi-site optimization studies have given a significant boost to the development of a set of parameterizations for the ultra-fast simulation of the LHCb experiment.
comment: To be published in Journal of Physics: Conference Series (ACAT 2022)
♻ ☆ On the Causal Sufficiency and Necessity of Multi-Modal Representation Learning
An effective paradigm of multi-modal learning (MML) is to learn unified representations among modalities. From a causal perspective, constraining the consistency between different modalities can mine causal representations that convey primary events. However, such simple consistency may face the risk of learning insufficient or unnecessary information: a necessary but insufficient cause is invariant across modalities but may not have the required accuracy; a sufficient but unnecessary cause tends to adapt well to specific modalities but may be hard to adapt to new data. To address this issue, in this paper, we aim to learn representations that are both causal sufficient and necessary, i.e., Causal Complete Cause ($C^3$), for MML. Firstly, we define the concept of $C^3$ for MML, which reflects the probability of being causal sufficiency and necessity. We also propose the identifiability and measurement of $C^3$, i.e., $C^3$ risk, to ensure calculating the learned representations' $C^3$ scores in practice. Then, we theoretically prove the effectiveness of $C^3$ risk by establishing the performance guarantee of MML with a tight generalization bound. Based on these theoretical results, we propose a plug-and-play method, namely Causal Complete Cause Regularization ($C^3$R), to learn causal complete representations by constraining the $C^3$ risk bound. Extensive experiments conducted on various benchmark datasets empirically demonstrate the effectiveness of $C^3$R.
♻ ☆ Empowering Aggregators with Practical Data-Driven Tools: Harnessing Aggregated and Disaggregated Flexibility for Demand Response
This study explores the interaction between aggregators and building occupants in activating flexibility through Demand Response (DR) programs, with a focus on reinforcing the resilience of the energy system considering the uncertainties presented by Renewable Energy Sources (RES). Firstly, it introduces a methodology of optimizing aggregated flexibility provision strategies in environments with limited data, utilizing Discrete Fourier Transformation (DFT) and clustering techniques to identify building occupants' activity patterns. Secondly, the study assesses the disaggregated flexibility provision of Heating Ventilation and Air Conditioning (HVAC) systems during DR events, employing machine learning and optimization techniques for precise, device-level analysis. The first approach offers a non-intrusive pathway for aggregators to provide flexibility services in environments of a single smart meter for the whole building's consumption, while the second approach maximizes the amount of flexibility in the case of dedicated metering devices to the HVAC systems by carefully considering building occupants' thermal comfort profiles. Through the application of data-driven techniques and encompassing case studies from both industrial and residential buildings, this paper not only unveils pivotal opportunities for aggregators in the balancing and emerging flexibility markets but also successfully develops and demonstrates end-to-end practical tools for aggregators.
♻ ☆ Markov flow policy -- deep MC
Discounted algorithms often encounter evaluation errors due to their reliance on short-term estimations, which can impede their efficacy in addressing simple, short-term tasks and impose undesired temporal discounts (\(\gamma\)). Interestingly, these algorithms are often tested without applying a discount, a phenomenon we refer as the \textit{train-test bias}. In response to these challenges, we propose the Markov Flow Policy, which utilizes a non-negative neural network flow to enable comprehensive forward-view predictions. Through integration into the TD7 codebase and evaluation using the MuJoCo benchmark, we observe significant performance improvements, positioning MFP as a straightforward, practical, and easily implementable solution within the domain of average rewards algorithms.
comment: Paper has been not finished
♻ ☆ Solving Collaborative Dec-POMDPs with Deep Reinforcement Learning Heuristics
WQMIX, QMIX, QTRAN, and VDN are SOTA algorithms for Dec-POMDP. All of them cannot solve complex agents' cooperation domains. We give an algorithm to solve such problems. In the first stage, we solve a single-agent problem and get a policy. In the second stage, we solve the multi-agent problem with the single-agent policy. SA2MA has a clear advantage over all competitors in complex agents' cooperative domains.
comment: Paper has been not finished
♻ ☆ FedAgg: Adaptive Federated Learning with Aggregated Gradients
Federated Learning (FL) has emerged as a crucial distributed training paradigm, enabling discrete devices to collaboratively train a shared model under the coordination of a central server, while leveraging their locally stored private data. Nonetheless, the non-independent-and-identically-distributed (Non-IID) data generated on heterogeneous clients and the incessant information exchange among participants may significantly impede training efficacy, retard the model convergence rate and increase the risk of privacy leakage. To alleviate the divergence between the local and average model parameters and obtain a fast model convergence rate, we propose an adaptive FEDerated learning algorithm called FedAgg by refining the conventional stochastic gradient descent (SGD) methodology with an AGgregated Gradient term at each local training epoch and adaptively adjusting the learning rate based on a penalty term that quantifies the local model deviation. To tackle the challenge of information exchange among clients during local training and design a decentralized adaptive learning rate for each client, we introduce two mean-field terms to approximate the average local parameters and gradients over time. Through rigorous theoretical analysis, we demonstrate the existence and convergence of the mean-field terms and provide a robust upper bound on the convergence of our proposed algorithm. The extensive experimental results on real-world datasets substantiate the superiority of our framework in comparison with existing state-of-the-art FL strategies for enhancing model performance and accelerating convergence rate under IID and Non-IID datasets.
♻ ☆ Training Neural Networks on Data Sources with Unknown Reliability
When data is generated by multiple sources, conventional training methods update models assuming equal reliability for each source and do not consider their individual data quality during training. However, in many applications, sources have varied levels of reliability that can have negative effects on the performance of a neural network. A key issue is that often the quality of data for individual sources is not known during training. Focusing on supervised learning, this work presents a solution that aims to train neural networks on each data source for a number of steps proportional to the source's estimated relative reliability. This way, we allow training on all sources during the warm-up, and reduce learning on less reliable sources during the final training stages, when it has been shown models overfit to noise. We show through diverse experiments, this can significantly improve model performance when trained on mixtures of reliable and unreliable data sources, and maintain performance when models are trained on reliable sources only.
♻ ☆ Mending of Spatio-Temporal Dependencies in Block Adjacency Matrix ICONIP 2024
In the realm of applications where data dynamically evolves across spatial and temporal dimensions, Graph Neural Networks (GNNs) are often complemented by sequence modeling architectures, such as RNNs and transformers, to effectively model temporal changes. These hybrid models typically arrange the spatial and temporal learning components in series. A pioneering effort to jointly model the spatio-temporal dependencies using only GNNs was the introduction of the Block Adjacency Matrix \(\mathbf{A_B}\) \cite{1}, which was constructed by diagonally concatenating adjacency matrices from graphs at different time steps. This approach resulted in a single graph encompassing complete spatio-temporal data; however, the graphs from different time steps remained disconnected, limiting GNN message-passing to spatially connected nodes only. Addressing this critical challenge, we propose a novel end-to-end learning architecture specifically designed to mend the temporal dependencies, resulting in a well-connected graph. Thus, we provide a framework for the learnable representation of spatio-temporal data as graphs. Our methodology demonstrates superior performance on benchmark datasets, such as SurgVisDom and C2D2, surpassing existing state-of-the-art graph models in terms of accuracy. Our model also achieves significantly lower computational complexity, having far fewer parameters than methods reliant on CLIP and 3D CNN architectures.
comment: Accepted at ICONIP 2024
♻ ☆ Enhancing Weather Predictions: Super-Resolution via Deep Diffusion Models
This study investigates the application of deep-learning diffusion models for the super-resolution of weather data, a novel approach aimed at enhancing the spatial resolution and detail of meteorological variables. Leveraging the capabilities of diffusion models, specifically the SR3 and ResDiff architectures, we present a methodology for transforming low-resolution weather data into high-resolution outputs. Our experiments, conducted using the WeatherBench dataset, focus on the super-resolution of the two-meter temperature variable, demonstrating the models' ability to generate detailed and accurate weather maps. The results indicate that the ResDiff model, further improved by incorporating physics-based modifications, significantly outperforms traditional SR3 methods in terms of Mean Squared Error (MSE), Structural Similarity Index (SSIM), and Peak Signal-to-Noise Ratio (PSNR). This research highlights the potential of diffusion models in meteorological applications, offering insights into their effectiveness, challenges, and prospects for future advancements in weather prediction and climate analysis.
♻ ☆ On Robust Reinforcement Learning with Lipschitz-Bounded Policy Networks
This paper presents a study of robust policy networks in deep reinforcement learning. We investigate the benefits of policy parameterizations that naturally satisfy constraints on their Lipschitz bound, analyzing their empirical performance and robustness on two representative problems: pendulum swing-up and Atari Pong. We illustrate that policy networks with smaller Lipschitz bounds are more robust to disturbances, random noise, and targeted adversarial attacks than unconstrained policies composed of vanilla multi-layer perceptrons or convolutional neural networks. However, the structure of the Lipschitz layer is important. We find that the widely-used method of spectral normalization is too conservative and severely impacts clean performance, whereas more expressive Lipschitz layers such as the recently-proposed Sandwich layer can achieve improved robustness without sacrificing clean performance.
♻ ☆ Generative Design of Crystal Structures by Point Cloud Representations and Diffusion Model
Efficiently generating energetically stable crystal structures has long been a challenge in material design, primarily due to the immense arrangement of atoms in a crystal lattice. To facilitate the discovery of stable material, we present a framework for the generation of synthesizable materials, leveraging a point cloud representation to encode intricate structural information. At the heart of this framework lies the introduction of a diffusion model as its foundational pillar. To gauge the efficacy of our approach, we employ it to reconstruct input structures from our training datasets, rigorously validating its high reconstruction performance. Furthermore, we demonstrate the profound potential of Point Cloud-Based Crystal Diffusion (PCCD) by generating entirely new materials, emphasizing their synthesizability. Our research stands as a noteworthy contribution to the advancement of materials design and synthesis through the cutting-edge avenue of generative design instead of the conventional substitution or experience-based discovery.
comment: I have submitted to a journal
♻ ☆ SciLitLLM: How to Adapt LLMs for Scientific Literature Understanding
Scientific literature understanding is crucial for extracting targeted information and garnering insights, thereby significantly advancing scientific discovery. Despite the remarkable success of Large Language Models (LLMs), they face challenges in scientific literature understanding, primarily due to (1) a lack of scientific knowledge and (2) unfamiliarity with specialized scientific tasks. To develop an LLM specialized in scientific literature understanding, we propose a hybrid strategy that integrates continual pre-training (CPT) and supervised fine-tuning (SFT), to simultaneously infuse scientific domain knowledge and enhance instruction-following capabilities for domain-specific tasks.cIn this process, we identify two key challenges: (1) constructing high-quality CPT corpora, and (2) generating diverse SFT instructions. We address these challenges through a meticulous pipeline, including PDF text extraction, parsing content error correction, quality filtering, and synthetic instruction creation. Applying this strategy, we present a suite of LLMs: SciLitLLM, specialized in scientific literature understanding. These models demonstrate promising performance on scientific literature understanding benchmarks. Our contributions are threefold: (1) We present an effective framework that integrates CPT and SFT to adapt LLMs to scientific literature understanding, which can also be easily adapted to other domains. (2) We propose an LLM-based synthesis method to generate diverse and high-quality scientific instructions, resulting in a new instruction set -- SciLitIns -- for supervised fine-tuning in less-represented scientific domains. (3) SciLitLLM achieves promising performance improvements on scientific literature understanding benchmarks.
♻ ☆ Anomaly Detection in Time Series of EDFA Pump Currents to Monitor Degeneration Processes using Fuzzy Clustering
This article proposes a novel fuzzy clustering based anomaly detection method for pump current time series of EDFA systems. The proposed change detection framework (CDF) strategically combines the advantages of entropy analysis (EA) and principle component analysis (PCA) with fuzzy clustering procedures. In the framework, EA is applied for dynamic selection of features for reduction of the feature space and increase of computational performance. Furthermore, PCA is utilized to extract features from the raw feature space to enable generalization capability of the subsequent fuzzy clustering procedures. Three different fuzzy clustering methods, more precisely the fuzzy clustering algorithm, a probabilistic clustering algorithm and a possibilistic clustering algorithm are evaluated for performance and generalization. Hence, the proposed framework has the innovative feature to detect changes in pump current time series at an early stage for arbitrary points of operation, compared to state-of-the-art predefined alarms in commercially used EDFAs. Moreover, the approach is implemented and tested using experimental data. In addition, the proposed framework enables further approaches of applying decentralized predictive maintenance for optical fiber networks.
comment: 6 pages, 6 figures
♻ ☆ Towards Learning Abductive Reasoning using VSA Distributed Representations
We introduce the Abductive Rule Learner with Context-awareness (ARLC), a model that solves abstract reasoning tasks based on Learn-VRF. ARLC features a novel and more broadly applicable training objective for abductive reasoning, resulting in better interpretability and higher accuracy when solving Raven's progressive matrices (RPM). ARLC allows both programming domain knowledge and learning the rules underlying a data distribution. We evaluate ARLC on the I-RAVEN dataset, showcasing state-of-the-art accuracy across both in-distribution and out-of-distribution (unseen attribute-rule pairs) tests. ARLC surpasses neuro-symbolic and connectionist baselines, including large language models, despite having orders of magnitude fewer parameters. We show ARLC's robustness to post-programming training by incrementally learning from examples on top of programmed knowledge, which only improves its performance and does not result in catastrophic forgetting of the programmed solution. We validate ARLC's seamless transfer learning from a 2x2 RPM constellation to unseen constellations. Our code is available at https://github.com/IBM/abductive-rule-learner-with-context-awareness.
comment: Accepted at the 18th International Conference on Neural-Symbolic Learning and Reasoning (NeSy) 2024 [Spotlight]
♻ ☆ Tailoring Adversarial Attacks on Deep Neural Networks for Targeted Class Manipulation Using DeepFool Algorithm
The susceptibility of deep neural networks (DNNs) to adversarial attacks undermines their reliability across numerous applications, underscoring the necessity for an in-depth exploration of these vulnerabilities and the formulation of robust defense strategies. The DeepFool algorithm by Moosavi-Dezfooli et al. (2016) represents a pivotal step in identifying minimal perturbations required to induce misclassification of input images. Nonetheless, its generic methodology falls short in scenarios necessitating targeted interventions. Additionally, previous research studies have predominantly concentrated on the success rate of attacks without adequately addressing the consequential distortion of images, the maintenance of image quality, or the confidence threshold required for misclassification. To bridge these gaps, we introduce the Enhanced Targeted DeepFool (ET DeepFool) algorithm, an evolution of DeepFool that not only facilitates the specification of desired misclassification targets but also incorporates a configurable minimum confidence score. Our empirical investigations demonstrate the superiority of this refined approach in maintaining the integrity of images and minimizing perturbations across a variety of DNN architectures. Unlike previous iterations, such as the Targeted DeepFool by Gajjar et al. (2022), our method grants unparalleled control over the perturbation process, enabling precise manipulation of model responses. Preliminary outcomes reveal that certain models, including AlexNet and the advanced Vision Transformer, display commendable robustness to such manipulations. This discovery of varying levels of model robustness, as unveiled through our confidence level adjustments, could have far-reaching implications for the field of image recognition. Our code will be made public upon acceptance of the paper.
comment: 18 pages, 5 figures
♻ ☆ Transformers are Expressive, But Are They Expressive Enough for Regression?
Transformers have become pivotal in Natural Language Processing, demonstrating remarkable success in applications like Machine Translation and Summarization. Given their widespread adoption, several works have attempted to analyze the expressivity of Transformers. Expressivity of a neural network is the class of functions it can approximate. A neural network is fully expressive if it can act as a universal function approximator. We attempt to analyze the same for Transformers. Contrary to existing claims, our findings reveal that Transformers struggle to reliably approximate smooth functions, relying on piecewise constant approximations with sizable intervals. The central question emerges as: ''Are Transformers truly Universal Function Approximators?'' To address this, we conduct a thorough investigation, providing theoretical insights and supporting evidence through experiments. Theoretically, we prove that Transformer Encoders cannot approximate smooth functions. Experimentally, we complement our theory and show that the full Transformer architecture cannot approximate smooth functions. By shedding light on these challenges, we advocate a refined understanding of Transformers' capabilities. Code Link: https://github.com/swaroop-nath/transformer-expressivity.
comment: 18 pages, 17 figures, 6 tables
♻ ☆ Perceptual Similarity for Measuring Decision-Making Style and Policy Diversity in Games
Defining and measuring decision-making styles, also known as playstyles, is crucial in gaming, where these styles reflect a broad spectrum of individuality and diversity. However, finding a universally applicable measure for these styles poses a challenge. Building on Playstyle Distance, the first unsupervised metric to measure playstyle similarity based on game screens and raw actions, we introduce three enhancements to increase accuracy: multiscale analysis with varied state granularity, a perceptual kernel rooted in psychology, and the utilization of the intersection-over-union method for efficient evaluation. These innovations not only advance measurement precision but also offer insights into human cognition of similarity. Across two racing games and seven Atari games, our techniques significantly improve the precision of zero-shot playstyle classification, achieving an accuracy exceeding 90 percent with fewer than 512 observation-action pairs, which is less than half an episode of these games. Furthermore, our experiments with 2048 and Go demonstrate the potential of discrete playstyle measures in puzzle and board games. We also develop an algorithm for assessing decision-making diversity using these measures. Our findings improve the measurement of end-to-end game analysis and the evolution of artificial intelligence for diverse playstyles.
comment: TMLR 08/2024 https://openreview.net/forum?id=30C9AWBW49
♻ ☆ Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming
Recent advances in language models have achieved significant progress. GPT-4o, as a new milestone, has enabled real-time conversations with humans, demonstrating near-human natural fluency. Such human-computer interaction necessitates models with the capability to perform reasoning directly with the audio modality and generate output in streaming. However, this remains beyond the reach of current academic models, as they typically depend on extra TTS systems for speech synthesis, resulting in undesirable latency. This paper introduces the Mini-Omni, an audio-based end-to-end conversational model, capable of real-time speech interaction. To achieve this capability, we propose a text-instructed speech generation method, along with batch-parallel strategies during inference to further boost the performance. Our method also helps to retain the original model's language capabilities with minimal degradation, enabling other works to establish real-time interaction capabilities. We call this training method "Any Model Can Talk". We also introduce the VoiceAssistant-400K dataset to fine-tune models optimized for speech output. To our best knowledge, Mini-Omni is the first fully end-to-end, open-source model for real-time speech interaction, offering valuable potential for future research.
comment: Technical report, work in progress. Demo and code: https://github.com/gpt-omni/mini-omni
♻ ☆ Graph Neural Networks in EEG-based Emotion Recognition: A Survey
Compared to other modalities, EEG-based emotion recognition can intuitively respond to the emotional patterns in the human brain and, therefore, has become one of the most concerning tasks in the brain-computer interfaces field. Since dependencies within brain regions are closely related to emotion, a significant trend is to develop Graph Neural Networks (GNNs) for EEG-based emotion recognition. However, brain region dependencies in emotional EEG have physiological bases that distinguish GNNs in this field from those in other time series fields. Besides, there is neither a comprehensive review nor guidance for constructing GNNs in EEG-based emotion recognition. In the survey, our categorization reveals the commonalities and differences of existing approaches under a unified framework of graph construction. We analyze and categorize methods from three stages in the framework to provide clear guidance on constructing GNNs in EEG-based emotion recognition. In addition, we discuss several open challenges and future directions, such as Temporal full-connected graph and Graph condensation.
♻ ☆ Advancing Chinese biomedical text mining with community challenges
Objective: This study aims to review the recent advances in community challenges for biomedical text mining in China. Methods: We collected information of evaluation tasks released in community challenges of biomedical text mining, including task description, dataset description, data source, task type and related links. A systematic summary and comparative analysis were conducted on various biomedical natural language processing tasks, such as named entity recognition, entity normalization, attribute extraction, relation extraction, event extraction, text classification, text similarity, knowledge graph construction, question answering, text generation, and large language model evaluation. Results: We identified 39 evaluation tasks from 6 community challenges that spanned from 2017 to 2023. Our analysis revealed the diverse range of evaluation task types and data sources in biomedical text mining. We explored the potential clinical applications of these community challenge tasks from a translational biomedical informatics perspective. We compared with their English counterparts, and discussed the contributions, limitations, lessons and guidelines of these community challenges, while highlighting future directions in the era of large language models. Conclusion: Community challenge evaluation competitions have played a crucial role in promoting technology innovation and fostering interdisciplinary collaboration in the field of biomedical text mining. These challenges provide valuable platforms for researchers to develop state-of-the-art solutions.
♻ ☆ Model-based RL as a Minimalist Approach to Horizon-Free and Second-Order Bounds
Learning a transition model via Maximum Likelihood Estimation (MLE) followed by planning inside the learned model is perhaps the most standard and simplest Model-based Reinforcement Learning (RL) framework. In this work, we show that such a simple Model-based RL scheme, when equipped with optimistic and pessimistic planning procedures, achieves strong regret and sample complexity bounds in online and offline RL settings. Particularly, we demonstrate that under the conditions where the trajectory-wise reward is normalized between zero and one and the transition is time-homogenous, it achieves horizon-free and second-order bounds. Horizon-free means that our bounds have no polynomial dependence on the horizon of the Markov Decision Process. A second-order bound is a type of instance-dependent bound that scales with respect to the variances of the returns of the policies which can be small when the system is nearly deterministic and (or) the optimal policy has small values. We highlight that our algorithms are simple, fairly standard, and indeed have been extensively studied in the RL literature: they learn a model via MLE, build a version space around the MLE solution, and perform optimistic or pessimistic planning depending on whether operating in the online or offline mode. These algorithms do not rely on additional specialized algorithmic designs such as learning variances and performing variance-weighted learning and thus can leverage rich function approximations that are significantly beyond linear or tabular structures. The simplicity of the algorithms also implies that our horizon-free and second-order regret analysis is actually standard and mainly follows the general framework of optimism/pessimism in the face of uncertainty.
♻ ☆ Towards Graph Prompt Learning: A Survey and Beyond
Large-scale "pre-train and prompt learning" paradigms have demonstrated remarkable adaptability, enabling broad applications across diverse domains such as question answering, image recognition, and multimodal retrieval. This approach fully leverages the potential of large-scale pre-trained models, reducing downstream data requirements and computational costs while enhancing model applicability across various tasks. Graphs, as versatile data structures that capture relationships between entities, play pivotal roles in fields such as social network analysis, recommender systems, and biological graphs. Despite the success of pre-train and prompt learning paradigms in Natural Language Processing (NLP) and Computer Vision (CV), their application in graph domains remains nascent. In graph-structured data, not only do the node and edge features often have disparate distributions, but the topological structures also differ significantly. This diversity in graph data can lead to incompatible patterns or gaps between pre-training and fine-tuning on downstream graphs. We aim to bridge this gap by summarizing methods for alleviating these disparities. This includes exploring prompt design methodologies, comparing related techniques, assessing application scenarios and datasets, and identifying unresolved problems and challenges. This survey categorizes over 100 relevant works in this field, summarizing general design principles and the latest applications, including text-attributed graphs, molecules, proteins, and recommendation systems. Through this extensive review, we provide a foundational understanding of graph prompt learning, aiming to impact not only the graph mining community but also the broader Artificial General Intelligence (AGI) community.
comment: 19 pages, 2 figures
♻ ☆ Etalon: Holistic Performance Evaluation Framework for LLM Inference Systems
Serving large language models (LLMs) in production can incur substantial costs, which has prompted recent advances in inference system optimizations. Today, these systems are evaluated against conventional latency and throughput metrics (eg. TTFT, TBT, Normalised Latency and TPOT). However, these metrics fail to fully capture the nuances of LLM inference, leading to an incomplete assessment of user-facing performance crucial for real-time applications such as chat and translation. In this paper, we first identify the pitfalls of current performance metrics in evaluating LLM inference systems. We then propose Etalon, a comprehensive performance evaluation framework that includes fluidity-index -- a novel metric designed to reflect the intricacies of the LLM inference process and its impact on real-time user experience. Finally, we evaluate various existing open-source platforms and model-as-a-service offerings using Etalon, discussing their strengths and weaknesses. Etalon is available at https://github.com/project-etalon/etalon.
♻ ☆ WhiteFox: White-Box Compiler Fuzzing Empowered by Large Language Models
Compiler correctness is crucial, as miscompilation can falsify program behaviors, leading to serious consequences. Fuzzing has been studied to uncover compiler defects. However, compiler fuzzing remains challenging: Existing arts focus on black- and grey-box fuzzing, which generates tests without sufficient understanding of internal compiler behaviors. Meanwhile, traditional white-box techniques, like symbolic execution, are computationally inapplicable to the giant codebase of compilers. Recent advances demonstrate that Large Language Models (LLMs) excel in code generation/understanding tasks. Nonetheless, guiding LLMs with compiler source-code information remains a missing piece of research in compiler testing. To this end, we propose WhiteFox, the first white-box compiler fuzzer using LLMs with source-code information to test compiler optimization, with a spotlight on detecting deep logic bugs in the deep learning (DL) compilers. WhiteFox adopts a multi-agent framework: an LLM-based analysis agent examines the low-level optimization source code and produces requirements on the high-level test programs that can trigger the optimization; an LLM-based generation agent produces test programs based on the summarized requirements. Additionally, optimization-triggering tests are used as feedback to enhance the generation on the fly. Our evaluation on the three most popular DL compilers (i.e., PyTorch Inductor, TensorFlow-XLA, and TensorFlow Lite) shows WhiteFox can generate high-quality test programs to exercise deep optimizations, practicing up to 8X more than state-of-the-art fuzzers. WhiteFox has found 101 bugs for the DL compilers, with 92 confirmed as previously unknown and 70 fixed. WhiteFox has been acknowledged by the PyTorch team and is being incorporated into its development workflow. Beyond DL compilers, WhiteFox can also be adapted for compilers in different domains.
comment: Published in OOPSLA 2024
♻ ☆ EEGMatch: Learning with Incomplete Labels for Semi-Supervised EEG-based Cross-Subject Emotion Recognition
Electroencephalography (EEG) is an objective tool for emotion recognition and shows promising performance. However, the label scarcity problem is a main challenge in this field, which limits the wide application of EEG-based emotion recognition. In this paper, we propose a novel semi-supervised learning framework (EEGMatch) to leverage both labeled and unlabeled EEG data. First, an EEG-Mixup based data augmentation method is developed to generate more valid samples for model learning. Second, a semi-supervised two-step pairwise learning method is proposed to bridge prototype-wise and instance-wise pairwise learning, where the prototype-wise pairwise learning measures the global relationship between EEG data and the prototypical representation of each emotion class and the instance-wise pairwise learning captures the local intrinsic relationship among EEG data. Third, a semi-supervised multi-domain adaptation is introduced to align the data representation among multiple domains (labeled source domain, unlabeled source domain, and target domain), where the distribution mismatch is alleviated. Extensive experiments are conducted on two benchmark databases (SEED and SEED-IV) under a cross-subject leave-one-subject-out cross-validation evaluation protocol. The results show the proposed EEGmatch performs better than the state-of-the-art methods under different incomplete label conditions (with 6.89% improvement on SEED and 1.44% improvement on SEED-IV), which demonstrates the effectiveness of the proposed EEGMatch in dealing with the label scarcity problem in emotion recognition using EEG signals. The source code is available at https://github.com/KAZABANA/EEGMatch.
♻ ☆ High Probability Complexity Bounds for Non-Smooth Stochastic Optimization with Heavy-Tailed Noise
Stochastic first-order methods are standard for training large-scale machine learning models. Random behavior may cause a particular run of an algorithm to result in a highly suboptimal objective value, whereas theoretical guarantees are usually proved for the expectation of the objective value. Thus, it is essential to theoretically guarantee that algorithms provide small objective residual with high probability. Existing methods for non-smooth stochastic convex optimization have complexity bounds with the dependence on the confidence level that is either negative-power or logarithmic but under an additional assumption of sub-Gaussian (light-tailed) noise distribution that may not hold in practice. In our paper, we resolve this issue and derive the first high-probability convergence results with logarithmic dependence on the confidence level for non-smooth convex stochastic optimization problems with non-sub-Gaussian (heavy-tailed) noise. To derive our results, we propose novel stepsize rules for two stochastic methods with gradient clipping. Moreover, our analysis works for generalized smooth objectives with H\"older-continuous gradients, and for both methods, we provide an extension for strongly convex problems. Finally, our results imply that the first (accelerated) method we consider also has optimal iteration and oracle complexity in all the regimes, and the second one is optimal in the non-smooth setting.
comment: 61 pages, 12 figures. Changes in V2: different presentation of the results, different structure, new experiments. Changes in V3: some typos were fixed
♻ ☆ Dense-Sparse Deep Convolutional Neural Networks Training for Image Denoising
Recently, deep learning methods such as the convolutional neural networks have gained prominence in the area of image denoising. This is owing to their proven ability to surpass state-of-the-art classical image denoising algorithms such as block-matching and 3D filtering algorithm. Deep denoising convolutional neural networks use many feed-forward convolution layers with added regularization methods of batch normalization and residual learning to speed up training and improve denoising performance significantly. However, this comes at the expense of a huge number of trainable parameters. In this paper, we show that by employing an enhanced dense-sparse-dense network training procedure to the deep denoising convolutional neural networks, comparable denoising performance level can be achieved at a significantly reduced number of trainable parameters. We derive motivation from the fact that networks trained using the dense-sparse-dense approach have been shown to attain performance boost with reduced number of parameters. The proposed reduced deep denoising convolutional neural networks network is an efficient denoising model with significantly reduced parameters and comparable performance to the deep denoising convolutional neural networks. Additionally, denoising was achieved at significantly reduced processing time.
♻ ☆ On The Fairness Impacts of Hardware Selection in Machine Learning
In the machine learning ecosystem, hardware selection is often regarded as a mere utility, overshadowed by the spotlight on algorithms and data. This oversight is particularly problematic in contexts like ML-as-a-service platforms, where users often lack control over the hardware used for model deployment. How does the choice of hardware impact generalization properties? This paper investigates the influence of hardware on the delicate balance between model performance and fairness. We demonstrate that hardware choices can exacerbate existing disparities, attributing these discrepancies to variations in gradient flows and loss surfaces across different demographic groups. Through both theoretical and empirical analysis, the paper not only identifies the underlying factors but also proposes an effective strategy for mitigating hardware-induced performance imbalances.
♻ ☆ FlakyFix: Using Large Language Models for Predicting Flaky Test Fix Categories and Test Code Repair
Flaky tests are problematic because they non-deterministically pass or fail for the same software version under test, causing confusion and wasting development effort. While machine learning models have been used to predict flakiness and its root causes, there is much less work on providing support to fix the problem. To address this gap, in this paper, we focus on predicting the type of fix that is required to remove flakiness and then repair the test code on that basis. We do this for a subset of flaky tests where the root cause of flakiness is in the test itself and not in the production code. One key idea is to guide the repair process with additional knowledge about the test's flakiness in the form of its predicted fix category. Thus, we first propose a framework that automatically generates labeled datasets for 13 fix categories and trains models to predict the fix category of a flaky test by analyzing the test code only. Our experimental results using code models and few-shot learning show that we can correctly predict most of the fix categories. To show the usefulness of such fix category labels for automatically repairing flakiness, we augment the prompts of GPT-3.5 Turbo, a Large Language Model (LLM), with such extra knowledge to request repair suggestions. The results show that our suggested fix category labels, complemented with in-context learning, significantly enhance the capability of GPT-3.5 Turbo in generating fixes for flaky tests. Based on the execution and analysis of a sample of GPT-repaired flaky tests, we estimate that a large percentage of such repairs (roughly between 51% and 83%) can be expected to pass. For the failing repaired tests, on average, 16% of the test code needs to be further changed for them to pass.
comment: 26 pages, 20 Figures
♻ ☆ Rasa: Building Expressive Speech Synthesis Systems for Indian Languages in Low-resource Settings INTERSPEECH 2024
We release Rasa, the first multilingual expressive TTS dataset for any Indian language, which contains 10 hours of neutral speech and 1-3 hours of expressive speech for each of the 6 Ekman emotions covering 3 languages: Assamese, Bengali, & Tamil. Our ablation studies reveal that just 1 hour of neutral and 30 minutes of expressive data can yield a Fair system as indicated by MUSHRA scores. Increasing neutral data to 10 hours, with minimal expressive data, significantly enhances expressiveness. This offers a practical recipe for resource-constrained languages, prioritizing easily obtainable neutral data alongside smaller amounts of expressive data. We show the importance of syllabically balanced data and pooling emotions to enhance expressiveness. We also highlight challenges in generating specific emotions, e.g., fear and surprise.
comment: Accepted at INTERSPEECH 2024. First two authors listed contributed equally
♻ ☆ Nonsmooth Projection-Free Optimization with Functional Constraints
This paper presents a subgradient-based algorithm for constrained nonsmooth convex optimization that does not require projections onto the feasible set. While the well-established Frank-Wolfe algorithm and its variants already avoid projections, they are primarily designed for smooth objective functions. In contrast, our proposed algorithm can handle nonsmooth problems with general convex functional inequality constraints. It achieves an $\epsilon$-suboptimal solution in $\mathcal{O}(\epsilon^{-2})$ iterations, with each iteration requiring only a single (potentially inexact) Linear Minimization Oracle (LMO) call and a (possibly inexact) subgradient computation. This performance is consistent with existing lower bounds. Similar performance is observed when deterministic subgradients are replaced with stochastic subgradients. In the special case where there are no functional inequality constraints, our algorithm competes favorably with a recent nonsmooth projection-free method designed for constraint-free problems. Our approach utilizes a simple separation scheme in conjunction with a new Lagrange multiplier update rule.
♻ ☆ Causality for Earth Science -- A Review on Time-series and Spatiotemporal Causality Methods
This survey paper covers the breadth and depth of time-series and spatiotemporal causality methods, and their applications in Earth Science. More specifically, the paper presents an overview of causal discovery and causal inference, explains the underlying causal assumptions, and enlists evaluation techniques and key terminologies of the domain area. The paper elicits the various state-of-the-art methods introduced for time-series and spatiotemporal causal analysis along with their strengths and limitations. The paper further describes the existing applications of several methods for answering specific Earth Science questions such as extreme weather events, sea level rise, teleconnections etc. This survey paper can serve as a primer for Data Science researchers interested in data-driven causal study as we share a list of resources, such as Earth Science datasets (synthetic, simulated and observational data) and open source tools for causal analysis. It will equally benefit the Earth Science community interested in taking an AI-driven approach to study the causality of different dynamic and thermodynamic processes as we present the open challenges and opportunities in performing causality-based Earth Science study.
♻ ☆ PSO Fuzzy XGBoost Classifier Boosted with Neural Gas Features on EEG Signals in Emotion Recognition
Emotion recognition is the technology-driven process of identifying and categorizing human emotions from various data sources, such as facial expressions, voice patterns, body motion, and physiological signals, such as EEG. These physiological indicators, though rich in data, present challenges due to their complexity and variability, necessitating sophisticated feature selection and extraction methods. NGN, an unsupervised learning algorithm, effectively adapts to input spaces without predefined grid structures, improving feature extraction from physiological data. Furthermore, the incorporation of fuzzy logic enables the handling of fuzzy data by introducing reasoning that mimics human decision-making. The combination of PSO with XGBoost aids in optimizing model performance through efficient hyperparameter tuning and decision process optimization. This study explores the integration of Neural-Gas Network (NGN), XGBoost, Particle Swarm Optimization (PSO), and fuzzy logic to enhance emotion recognition using physiological signals. Our research addresses three critical questions concerning the improvement of XGBoost with PSO and fuzzy logic, NGN's effectiveness in feature selection, and the performance comparison of the PSO-fuzzy XGBoost classifier with standard benchmarks. Acquired results indicate that our methodologies enhance the accuracy of emotion recognition systems and outperform other feature selection techniques using the majority of classifiers, offering significant implications for both theoretical advancement and practical application in emotion recognition technology.
comment: PSO, Fuzzy, XGBoost, Neural Gas Network (NGN), Feature Selection, EEG Signals, Emotion Recognition
♻ ☆ Hardware-Assisted Virtualization of Neural Processing Units for Cloud Platforms MICRO'24
Cloud platforms today have been deploying hardware accelerators like neural processing units (NPUs) for powering machine learning (ML) inference services. To maximize the resource utilization while ensuring reasonable quality of service, a natural approach is to virtualize NPUs for efficient resource sharing for multi-tenant ML services. However, virtualizing NPUs for modern cloud platforms is not easy. This is not only due to the lack of system abstraction support for NPU hardware, but also due to the lack of architectural and ISA support for enabling fine-grained dynamic operator scheduling for virtualized NPUs. We present Neu10, a holistic NPU virtualization framework. We investigate virtualization techniques for NPUs across the entire software and hardware stack. Neu10 consists of (1) a flexible NPU abstraction called vNPU, which enables fine-grained virtualization of the heterogeneous compute units in a physical NPU (pNPU); (2) a vNPU resource allocator that enables pay-as-you-go computing model and flexible vNPU-to-pNPU mappings for improved resource utilization and cost-effectiveness; (3) an ISA extension of modern NPU architecture for facilitating fine-grained tensor operator scheduling for multiple vNPUs. We implement Neu10 based on a production-level NPU simulator. Our experiments show that Neu10 improves the throughput of ML inference services by up to 1.4$\times$ and reduces the tail latency by up to 4.6$\times$, while improving the NPU utilization by 1.2$\times$ on average, compared to state-of-the-art NPU sharing approaches.
comment: Accepted to MICRO'24
♻ ☆ LLaVaOLMoBitnet1B: Ternary LLM goes Multimodal!
Multimodal Large Language Models (MM-LLMs) have seen significant advancements in the last year, demonstrating impressive performance across tasks. However, to truly democratize AI, models must exhibit strong capabilities and be able to run efficiently on small compute footprints accessible by most. Part of this quest, we introduce LLaVaOLMoBitnet1B - the first Ternary Multimodal LLM capable of accepting Image(s)+Text inputs to produce coherent textual responses. The model is fully open-sourced along with training scripts to encourage further research in this space. This accompanying technical report highlights the training process, evaluation details, challenges associated with ternary models and future opportunities. Link to the model: https://huggingface.co/IntelLabs/LlavaOLMoBitnet1B
♻ ☆ Scaling Laws for Data Poisoning in LLMs
Recent work shows that LLMs are vulnerable to data poisoning, in which they are trained on partially corrupted or harmful data. Poisoned data is hard to detect, breaks guardrails, and leads to undesirable and harmful behavior. Given the intense efforts by leading labs to train and deploy increasingly larger and more capable LLMs, it is critical to ask if the risk of data poisoning will be naturally mitigated by scale, or if it is an increasing threat. We consider three threat models by which data poisoning can occur: malicious fine-tuning, imperfect data curation, and intentional data contamination. Our experiments evaluate the effects of data poisoning on 23 frontier LLMs ranging from 1.5-72 billion parameters on three datasets which speak to each of our threat models. We find that larger LLMs are increasingly vulnerable, learning harmful behavior significantly more quickly than smaller LLMs with even minimal data poisoning. These results underscore the need for robust safeguards against data poisoning in larger LLMs.
♻ ☆ Criticality Leveraged Adversarial Training (CLAT) for Boosted Performance via Parameter Efficiency
Adversarial training enhances neural network robustness but suffers from a tendency to overfit and increased generalization errors on clean data. This work introduces CLAT, an innovative approach that mitigates adversarial overfitting by introducing parameter efficiency into the adversarial training process, improving both clean accuracy and adversarial robustness. Instead of tuning the entire model, CLAT identifies and fine-tunes robustness-critical layers - those predominantly learning non-robust features - while freezing the remaining model to enhance robustness. It employs dynamic critical layer selection to adapt to changes in layer criticality throughout the fine-tuning process. Empirically, CLAT can be applied on top of existing adversarial training methods, significantly reduces the number of trainable parameters by approximately 95%, and achieves more than a 2% improvement in adversarial robustness compared to baseline methods.
comment: 9 pages + appendix/ additional experiments
♻ ☆ Mix-Domain Contrastive Learning for Unpaired H&E-to-IHC Stain Translation
H&E-to-IHC stain translation techniques offer a promising solution for precise cancer diagnosis, especially in low-resource regions where there is a shortage of health professionals and limited access to expensive equipment. Considering the pixel-level misalignment of H&E-IHC image pairs, current research explores the pathological consistency between patches from the same positions of the image pair. However, most of them overemphasize the correspondence between domains or patches, overlooking the side information provided by the non-corresponding objects. In this paper, we propose a Mix-Domain Contrastive Learning (MDCL) method to leverage the supervision information in unpaired H&E-to-IHC stain translation. Specifically, the proposed MDCL method aggregates the inter-domain and intra-domain pathology information by estimating the correlation between the anchor patch and all the patches from the matching images, encouraging the network to learn additional contrastive knowledge from mixed domains. With the mix-domain pathology information aggregation, MDCL enhances the pathological consistency between the corresponding patches and the component discrepancy of the patches from the different positions of the generated IHC image. Extensive experiments on two H&E-to-IHC stain translation datasets, namely MIST and BCI, demonstrate that the proposed method achieves state-of-the-art performance across multiple metrics.
♻ ☆ FRACTAL: An Ultra-Large-Scale Aerial Lidar Dataset for 3D Semantic Segmentation of Diverse Landscapes
Mapping agencies are increasingly adopting Aerial Lidar Scanning (ALS) as a new tool to map buildings and other above-ground structures. Processing ALS data at scale requires efficient point classification methods that perform well over highly diverse territories. Large annotated Lidar datasets are needed to evaluate these classification methods, however, current Lidar benchmarks have restricted scope and often cover a single urban area. To bridge this data gap, we introduce the FRench ALS Clouds from TArgeted Landscapes (FRACTAL) dataset: an ultra-large-scale aerial Lidar dataset made of 100,000 dense point clouds with high quality labels for 7 semantic classes and spanning 250 km$^2$. FRACTAL achieves high spatial and semantic diversity by explicitly sampling rare classes and challenging landscapes from five different regions of France. We describe the data collection, annotation, and curation process of the dataset. We provide baseline semantic segmentation results using a state of the art 3D point cloud classification model. FRACTAL aims to support the development of 3D deep learning approaches for large-scale land monitoring.
comment: 9 (body) + 2 (bibliography) + 8 (appendices) pages | Dataset is available at https://huggingface.co/datasets/IGNF/FRACTAL | Trained model is available at https://huggingface.co/IGNF/FRACTAL-LidarHD_7cl_randlanet | Deep learning code repository is on Gihtub at https://github.com/IGNF/myria3d | Data engineering code repository is on Github at https://github.com/IGNF/pacasam
♻ ☆ Matryoshka Diffusion Models ICLR2024
Diffusion models are the de facto approach for generating high-quality images and videos, but learning high-dimensional models remains a formidable task due to computational and optimization challenges. Existing methods often resort to training cascaded models in pixel space or using a downsampled latent space of a separately trained auto-encoder. In this paper, we introduce Matryoshka Diffusion Models(MDM), an end-to-end framework for high-resolution image and video synthesis. We propose a diffusion process that denoises inputs at multiple resolutions jointly and uses a NestedUNet architecture where features and parameters for small-scale inputs are nested within those of large scales. In addition, MDM enables a progressive training schedule from lower to higher resolutions, which leads to significant improvements in optimization for high-resolution generation. We demonstrate the effectiveness of our approach on various benchmarks, including class-conditioned image generation, high-resolution text-to-image, and text-to-video applications. Remarkably, we can train a single pixel-space model at resolutions of up to 1024x1024 pixels, demonstrating strong zero-shot generalization using the CC12M dataset, which contains only 12 million images. Our code is released at https://github.com/apple/ml-mdm
comment: Accepted by ICLR2024
Multimedia
☆ LAR-IQA: A Lightweight, Accurate, and Robust No-Reference Image Quality Assessment Model
Recent advancements in the field of No-Reference Image Quality Assessment (NR-IQA) using deep learning techniques demonstrate high performance across multiple open-source datasets. However, such models are typically very large and complex making them not so suitable for real-world deployment, especially on resource- and battery-constrained mobile devices. To address this limitation, we propose a compact, lightweight NR-IQA model that achieves state-of-the-art (SOTA) performance on ECCV AIM UHD-IQA challenge validation and test datasets while being also nearly 5.7 times faster than the fastest SOTA model. Our model features a dual-branch architecture, with each branch separately trained on synthetically and authentically distorted images which enhances the model's generalizability across different distortion types. To improve robustness under diverse real-world visual conditions, we additionally incorporate multiple color spaces during the training process. We also demonstrate the higher accuracy of recently proposed Kolmogorov-Arnold Networks (KANs) for final quality regression as compared to the conventional Multi-Layer Perceptrons (MLPs). Our evaluation considering various open-source datasets highlights the practical, high-accuracy, and robust performance of our proposed lightweight model. Code: https://github.com/nasimjamshidi/LAR-IQA.
☆ Video to Music Moment Retrieval
Adding proper background music helps complete a short video to be shared. Towards automating the task, previous research focuses on video-to-music retrieval (VMR), aiming to find amidst a collection of music the one best matching the content of a given video. Since music tracks are typically much longer than short videos, meaning the returned music has to be cut to a shorter moment, there is a clear gap between the practical need and VMR. In order to bridge the gap, we propose in this paper video to music moment retrieval (VMMR) as a new task. To tackle the new task, we build a comprehensive dataset Ad-Moment which contains 50K short videos annotated with music moments and develop a two-stage approach. In particular, given a test video, the most similar music is retrieved from a given collection. Then, a Transformer based music moment localization is performed. We term this approach Retrieval and Localization (ReaL). Extensive experiments on real-world datasets verify the effectiveness of the proposed method for VMMR.
Computation and Language
☆ SAM2Point: Segment Any 3D as Videos in Zero-shot and Promptable Manners
We introduce SAM2Point, a preliminary exploration adapting Segment Anything Model 2 (SAM 2) for zero-shot and promptable 3D segmentation. SAM2Point interprets any 3D data as a series of multi-directional videos, and leverages SAM 2 for 3D-space segmentation, without further training or 2D-3D projection. Our framework supports various prompt types, including 3D points, boxes, and masks, and can generalize across diverse scenarios, such as 3D objects, indoor scenes, outdoor environments, and raw sparse LiDAR. Demonstrations on multiple 3D datasets, e.g., Objaverse, S3DIS, ScanNet, Semantic3D, and KITTI, highlight the robust generalization capabilities of SAM2Point. To our best knowledge, we present the most faithful implementation of SAM in 3D, which may serve as a starting point for future research in promptable 3D segmentation. Online Demo: https://huggingface.co/spaces/ZiyuG/SAM2Point . Code: https://github.com/ZiyuGuo99/SAM2Point .
comment: Work in progress. Online Demo: https://huggingface.co/spaces/ZiyuG/SAM2Point . Code: https://github.com/ZiyuGuo99/SAM2Point
☆ How Far Can Cantonese NLP Go? Benchmarking Cantonese Capabilities of Large Language Models
The rapid evolution of large language models (LLMs) has transformed the competitive landscape in natural language processing (NLP), particularly for English and other data-rich languages. However, underrepresented languages like Cantonese, spoken by over 85 million people, face significant development gaps, which is particularly concerning given the economic significance of the Guangdong-Hong Kong-Macau Greater Bay Area, and in substantial Cantonese-speaking populations in places like Singapore and North America. Despite its wide use, Cantonese has scant representation in NLP research, especially compared to other languages from similarly developed regions. To bridge these gaps, we outline current Cantonese NLP methods and introduce new benchmarks designed to evaluate LLM performance in factual generation, mathematical logic, complex reasoning, and general knowledge in Cantonese, which aim to advance open-source Cantonese LLM technology. We also propose future research directions and recommended models to enhance Cantonese LLM development.
☆ Reinforcement Learning without Human Feedback for Last Mile Fine-Tuning of Large Language Models
Reinforcement learning is used to align language models with human preference signals after first pre-training the model to predict the next token of text within a large corpus using likelihood maximization. Before being deployed in a specific domain, models are often further fine-tuned on task specific data. Since human preferences are often unavailable for the last step, it is performed using likelihood maximization as that is the typical default method. However, reinforcement learning has other advantages besides facilitating alignment to a human derived reward function. For one, whereas likelihood maximization is a form of imitation learning in which the model is trained on what to do under ideal conditions, reinforcement learning is not limited to demonstrating actions just for optimally reached states and trains a model what to do under a range of scenarios as it explores the policy space. In addition, it also trains a model what not to do, suppressing competitive but poor actions. This work develops a framework for last-mile fine-tuning using reinforcement learning and tests whether it garners performance gains. The experiments center on abstractive summarization, but the framework is general and broadly applicable. Use of the procedure produced significantly better results than likelihood maximization when comparing raw predictions. For the specific data tested, the gap could be bridged by employing post-processing of the maximum likelihood outputs. Nonetheless, the framework offers a new avenue for model optimization in situations where post-processing may be less straightforward or effective, and it can be extended to include more complex classes of undesirable outputs to penalize and train against, such as hallucinations.
☆ A Gradient Analysis Framework for Rewarding Good and Penalizing Bad Examples in Language Models
Beyond maximum likelihood estimation (MLE), the standard objective of a language model (LM) that optimizes good examples probabilities, many studies have explored ways that also penalize bad examples for enhancing the quality of output distribution, including unlikelihood training, exponential maximizing average treatment effect (ExMATE), and direct preference optimization (DPO). To systematically compare these methods and further provide a unified recipe for LM optimization, in this paper, we present a unique angle of gradient analysis of loss functions that simultaneously reward good examples and penalize bad ones in LMs. Through both mathematical results and experiments on CausalDialogue and Anthropic HH-RLHF datasets, we identify distinct functional characteristics among these methods. We find that ExMATE serves as a superior surrogate for MLE, and that combining DPO with ExMATE instead of MLE further enhances both the statistical (5-7%) and generative (+18% win rate) performance.
☆ Assessing Large Language Models for Online Extremism Research: Identification, Explanation, and New Knowledge
The United States has experienced a significant increase in violent extremism, prompting the need for automated tools to detect and limit the spread of extremist ideology online. This study evaluates the performance of Bidirectional Encoder Representations from Transformers (BERT) and Generative Pre-Trained Transformers (GPT) in detecting and classifying online domestic extremist posts. We collected social media posts containing "far-right" and "far-left" ideological keywords and manually labeled them as extremist or non-extremist. Extremist posts were further classified into one or more of five contributing elements of extremism based on a working definitional framework. The BERT model's performance was evaluated based on training data size and knowledge transfer between categories. We also compared the performance of GPT 3.5 and GPT 4 models using different prompts: na\"ive, layperson-definition, role-playing, and professional-definition. Results showed that the best performing GPT models outperformed the best performing BERT models, with more detailed prompts generally yielding better results. However, overly complex prompts may impair performance. Different versions of GPT have unique sensitives to what they consider extremist. GPT 3.5 performed better at classifying far-left extremist posts, while GPT 4 performed better at classifying far-right extremist posts. Large language models, represented by GPT models, hold significant potential for online extremism classification tasks, surpassing traditional BERT models in a zero-shot setting. Future research should explore human-computer interactions in optimizing GPT models for extremist detection and classification tasks to develop more efficient (e.g., quicker, less effort) and effective (e.g., fewer errors or mistakes) methods for identifying extremist content.
☆ Theoretical and Methodological Framework for Studying Texts Produced by Large Language Models
This paper addresses the conceptual, methodological and technical challenges in studying large language models (LLMs) and the texts they produce from a quantitative linguistics perspective. It builds on a theoretical framework that distinguishes between the LLM as a substrate and the entities the model simulates. The paper advocates for a strictly non-anthropomorphic approach to models while cautiously applying methodologies used in studying human linguistic behavior to the simulated entities. While natural language processing researchers focus on the models themselves, their architecture, evaluation, and methods for improving performance, we as quantitative linguists should strive to build a robust theory concerning the characteristics of texts produced by LLMs, how they differ from human-produced texts, and the properties of simulated entities. Additionally, we should explore the potential of LLMs as an instrument for studying human culture, of which language is an integral part.
☆ Smaller, Weaker, Yet Better: Training LLM Reasoners via Compute-Optimal Sampling
Training on high-quality synthetic data from strong language models (LMs) is a common strategy to improve the reasoning performance of LMs. In this work, we revisit whether this strategy is compute-optimal under a fixed inference budget (e.g., FLOPs). To do so, we investigate the trade-offs between generating synthetic data using a stronger but more expensive (SE) model versus a weaker but cheaper (WC) model. We evaluate the generated data across three key metrics: coverage, diversity, and false positive rate, and show that the data from WC models may have higher coverage and diversity, but also exhibit higher false positive rates. We then finetune LMs on data from SE and WC models in different settings: knowledge distillation, self-improvement, and a novel weak-to-strong improvement setup where a weaker LM teaches reasoning to a stronger LM. Our findings reveal that models finetuned on WC-generated data consistently outperform those trained on SE-generated data across multiple benchmarks and multiple choices of WC and SE models. These results challenge the prevailing practice of relying on SE models for synthetic data generation, suggesting that WC may be the compute-optimal approach for training advanced LM reasoners.
☆ Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming
Recent advances in language models have achieved significant progress. GPT-4o, as a new milestone, has enabled real-time conversations with humans, demonstrating near-human natural fluency. Such human-computer interaction necessitates models with the capability to perform reasoning directly with the audio modality and generate output in streaming. However, this remains beyond the reach of current academic models, as they typically depend on extra TTS systems for speech synthesis, resulting in undesirable latency. This paper introduces the Mini-Omni, an audio-based end-to-end conversational model, capable of real-time speech interaction. To achieve this capability, we propose a text-instructed speech generation method, along with batch-parallel strategies during inference to further boost the performance. Our method also helps to retain the original model's language capabilities with minimal degradation, enabling other works to establish real-time interaction capabilities. We call this training method "Any Model Can Talk". We also introduce the VoiceAssistant-400K dataset to fine-tune models optimized for speech output. To our best knowledge, Mini-Omni is the first fully end-to-end, open-source model for real-time speech interaction, offering valuable potential for future research.
comment: 10 pages
☆ Jina-ColBERT-v2: A General-Purpose Multilingual Late Interaction Retriever
Multi-vector dense models, such as ColBERT, have proven highly effective in information retrieval. ColBERT's late interaction scoring approximates the joint query-document attention seen in cross-encoders while maintaining inference efficiency closer to traditional dense retrieval models, thanks to its bi-encoder architecture and recent optimizations in indexing and search. In this paper, we introduce several improvements to the ColBERT model architecture and training pipeline, leveraging techniques successful in the more established single-vector embedding model paradigm, particularly those suited for heterogeneous multilingual data. Our new model, Jina-ColBERT-v2, demonstrates strong performance across a range of English and multilingual retrieval tasks, while also cutting storage requirements by up to 50% compared to previous models.
☆ Iterative Graph Alignment
By compressing diverse narratives, LLMs go beyond memorization, achieving intelligence by capturing generalizable causal relationships. However, they suffer from local 'representation gaps' due to insufficient training data diversity, limiting their real-world utility, especially in tasks requiring strict alignment to rules. Traditional alignment methods relying on heavy human annotations are inefficient and unscalable. Recent self-alignment techniques also fall short, as they often depend on self-selection based prompting and memorization-based learning. To address these issues, we introduce Iterative Graph Alignment (IGA), an annotation-free rule-based alignment algorithm. A teacher model (VLM) employs Iterative Graph Prompting (IGP) to create logical graphs and reference answers. The student model (LLM) identifies local knowledge gaps by attempting to align its responses with these references, collaborating with helper models to generate diverse answers. These aligned responses are then used for iterative supervised fine-tuning (SFT). Our evaluations across five rule-based scenarios demonstrate IGP's effectiveness, with a 73.12\% alignment improvement in Claude Sonnet 3.5, and Llama3-8B-Instruct achieving an 86.20\% improvement, outperforming Claude Sonnet 3.5 in rule-based alignment.
comment: 12 pages, 4 figures
☆ Enhancing Dialogue Generation in Werewolf Game Through Situation Analysis and Persuasion Strategies
Recent advancements in natural language processing, particularly with large language models (LLMs) like GPT-4, have significantly enhanced dialogue systems, enabling them to generate more natural and fluent conversations. Despite these improvements, challenges persist, such as managing continuous dialogues, memory retention, and minimizing hallucinations. The AIWolfDial2024 addresses these challenges by employing the Werewolf Game, an incomplete information game, to test the capabilities of LLMs in complex interactive environments. This paper introduces a LLM-based Werewolf Game AI, where each role is supported by situation analysis to aid response generation. Additionally, for the werewolf role, various persuasion strategies, including logical appeal, credibility appeal, and emotional appeal, are employed to effectively persuade other players to align with its actions.
comment: Accepted to the AIWolfDial2024 workshop at INLG 2024
☆ Predictability maximization and the origins of word order harmony
We address the linguistic problem of the sequential arrangement of a head and its dependents from an information theoretic perspective. In particular, we consider the optimal placement of a head that maximizes the predictability of the sequence. We assume that dependents are statistically independent given a head, in line with the open-choice principle and the core assumptions of dependency grammar. We demonstrate the optimality of harmonic order, i.e., placing the head last maximizes the predictability of the head whereas placing the head first maximizes the predictability of dependents. We also show that postponing the head is the optimal strategy to maximize its predictability while bringing it forward is the optimal strategy to maximize the predictability of dependents. We unravel the advantages of the strategy of maximizing the predictability of the head over maximizing the predictability of dependents. Our findings shed light on the placements of the head adopted by real languages or emerging in different kinds of experiments.
☆ SALSA: Speedy ASR-LLM Synchronous Aggregation INTERSPEECH 2024
Harnessing pre-trained LLMs to improve ASR systems, particularly for low-resource languages, is now an emerging area of research. Existing methods range from using LLMs for ASR error correction to tightly coupled systems that replace the ASR decoder with the LLM. These approaches either increase decoding time or require expensive training of the cross-attention layers. We propose SALSA, which couples the decoder layers of the ASR to the LLM decoder, while synchronously advancing both decoders. Such coupling is performed with a simple projection of the last decoder state, and is thus significantly more training efficient than earlier approaches. A challenge of our proposed coupling is handling the mismatch between the tokenizers of the LLM and ASR systems. We handle this mismatch using cascading tokenization with respect to the LLM and ASR vocabularies. We evaluate SALSA on 8 low-resource languages in the FLEURS benchmark, yielding substantial WER reductions of up to 38%.
comment: Accepted to INTERSPEECH 2024
☆ CNIMA: A Universal Evaluation Framework and Automated Approach for Assessing Second Language Dialogues
We develop CNIMA (Chinese Non-Native Interactivity Measurement and Automation), a Chinese-as-a-second-language labelled dataset with 10K dialogues. We annotate CNIMA using an evaluation framework -- originally introduced for English-as-a-second-language dialogues -- that assesses micro-level features (e.g.\ backchannels) and macro-level interactivity labels (e.g.\ topic management) and test the framework's transferability from English to Chinese. We found the framework robust across languages and revealed universal and language-specific relationships between micro-level and macro-level features. Next, we propose an approach to automate the evaluation and find strong performance, creating a new tool for automated second language assessment. Our system can be adapted to other languages easily as it uses large language models and as such does not require large-scale annotated training data.
☆ LLMs vs Established Text Augmentation Techniques for Classification: When do the Benefits Outweight the Costs?
The generative large language models (LLMs) are increasingly being used for data augmentation tasks, where text samples are LLM-paraphrased and then used for classifier fine-tuning. However, a research that would confirm a clear cost-benefit advantage of LLMs over more established augmentation methods is largely missing. To study if (and when) is the LLM-based augmentation advantageous, we compared the effects of recent LLM augmentation methods with established ones on 6 datasets, 3 classifiers and 2 fine-tuning methods. We also varied the number of seeds and collected samples to better explore the downstream model accuracy space. Finally, we performed a cost-benefit analysis and show that LLM-based methods are worthy of deployment only when very small number of seeds is used. Moreover, in many cases, established methods lead to similar or better model accuracies.
comment: 20 pages
☆ Learning from Negative Samples in Generative Biomedical Entity Linking
Generative models have become widely used in biomedical entity linking (BioEL) due to their excellent performance and efficient memory usage. However, these models are usually trained only with positive samples--entities that match the input mention's identifier--and do not explicitly learn from hard negative samples, which are entities that look similar but have different meanings. To address this limitation, we introduce ANGEL (Learning from Negative Samples in Generative Biomedical Entity Linking), the first framework that trains generative BioEL models using negative samples. Specifically, a generative model is initially trained to generate positive samples from the knowledge base for given input entities. Subsequently, both correct and incorrect outputs are gathered from the model's top-k predictions. The model is then updated to prioritize the correct predictions through direct preference optimization. Our models fine-tuned with ANGEL outperform the previous best baseline models by up to an average top-1 accuracy of 1.4% on five benchmarks. When incorporating our framework into pre-training, the performance improvement further increases to 1.7%, demonstrating its effectiveness in both the pre-training and fine-tuning stages. Our code is available at https://github.com/dmis-lab/ANGEL.
☆ Self-Alignment: Improving Alignment of Cultural Values in LLMs via In-Context Learning
Improving the alignment of Large Language Models (LLMs) with respect to the cultural values that they encode has become an increasingly important topic. In this work, we study whether we can exploit existing knowledge about cultural values at inference time to adjust model responses to cultural value probes. We present a simple and inexpensive method that uses a combination of in-context learning (ICL) and human survey data, and show that we can improve the alignment to cultural values across 5 models that include both English-centric and multilingual LLMs. Importantly, we show that our method could prove useful in test languages other than English and can improve alignment to the cultural values that correspond to a range of culturally diverse countries.
☆ Is text normalization relevant for classifying medieval charters?
This study examines the impact of historical text normalization on the classification of medieval charters, specifically focusing on document dating and locating. Using a data set of Middle High German charters from a digital archive, we evaluate various classifiers, including traditional and transformer-based models, with and without normalization. Our results indicate that the given normalization minimally improves locating tasks but reduces accuracy for dating, implying that original texts contain crucial features that normalization may obscure. We find that support vector machines and gradient boosting outperform other models, questioning the efficiency of transformers for this use case. Results suggest a selective approach to historical text normalization, emphasizing the significance of preserving some textual characteristics that are critical for classification tasks in document analysis.
comment: This preprint has not undergone peer review or any post-submission improvements or corrections
SurveySum: A Dataset for Summarizing Multiple Scientific Articles into a Survey Section
Document summarization is a task to shorten texts into concise and informative summaries. This paper introduces a novel dataset designed for summarizing multiple scientific articles into a section of a survey. Our contributions are: (1) SurveySum, a new dataset addressing the gap in domain-specific summarization tools; (2) two specific pipelines to summarize scientific articles into a section of a survey; and (3) the evaluation of these pipelines using multiple metrics to compare their performance. Our results highlight the importance of high-quality retrieval stages and the impact of different configurations on the quality of generated summaries.
comment: 15 pages, 6 figures, 1 table. Submitted to BRACIS 2024
☆ Instruction-tuned Large Language Models for Machine Translation in the Medical Domain
Large Language Models (LLMs) have shown promising results on machine translation for high resource language pairs and domains. However, in specialised domains (e.g. medical) LLMs have shown lower performance compared to standard neural machine translation models. The consistency in the machine translation of terminology is crucial for users, researchers, and translators in specialised domains. In this study, we compare the performance between baseline LLMs and instruction-tuned LLMs in the medical domain. In addition, we introduce terminology from specialised medical dictionaries into the instruction formatted datasets for fine-tuning LLMs. The instruction-tuned LLMs significantly outperform the baseline models with automatic metrics.
☆ MQM-Chat: Multidimensional Quality Metrics for Chat Translation
The complexities of chats pose significant challenges for machine translation models. Recognizing the need for a precise evaluation metric to address the issues of chat translation, this study introduces Multidimensional Quality Metrics for Chat Translation (MQM-Chat). Through the experiments of five models using MQM-Chat, we observed that all models generated certain fundamental errors, while each of them has different shortcomings, such as omission, overly correcting ambiguous source content, and buzzword issues, resulting in the loss of stylized information. Our findings underscore the effectiveness of MQM-Chat in evaluating chat translation, emphasizing the importance of stylized content and dialogue consistency for future studies.
☆ The Unreasonable Ineffectiveness of Nucleus Sampling on Mitigating Text Memorization
This work analyses the text memorization behavior of large language models (LLMs) when subjected to nucleus sampling. Stochastic decoding methods like nucleus sampling are typically applied to overcome issues such as monotonous and repetitive text generation, which are often observed with maximization-based decoding techniques. We hypothesize that nucleus sampling might also reduce the occurrence of memorization patterns, because it could lead to the selection of tokens outside the memorized sequence. To test this hypothesis we create a diagnostic dataset with a known distribution of duplicates that gives us some control over the likelihood of memorization of certain parts of the training data. Our analysis of two GPT-Neo models fine-tuned on this dataset interestingly shows that (i) an increase of the nucleus size reduces memorization only modestly, and (ii) even when models do not engage in "hard" memorization -- a verbatim reproduction of training samples -- they may still display "soft" memorization whereby they generate outputs that echo the training data but without a complete one-by-one resemblance.
comment: 9 pages, Accepted at INLG 2024 (International Natural Language Generation Conference)
☆ Critic-CoT: Boosting the reasoning abilities of large language model via Chain-of-thoughts Critic
Self-critic has become an important mechanism for enhancing the reasoning performance of LLMs. However, current approaches mainly involve basic prompts without further training, which tend to be over-simplified, leading to limited accuracy.Moreover, there is a lack of in-depth investigation of the relationship between LLM's ability to criticism and its task-solving performance.To address these issues, we propose Critic-CoT, a novel framework that pushes LLMs toward System-2-like critic capability, via step-wise CoT reasoning format and distant-supervision data construction, without the need for human annotation. Experiments on GSM8K and MATH show that via filtering out invalid solutions or iterative refinement, our enhanced model boosts task-solving performance, which demonstrates the effectiveness of our method. Further, we find that training on critique and refinement alone improves the generation. We hope our work could shed light on future research on improving the reasoning and critic ability of LLMs.
☆ Physics of Language Models: Part 2.2, How to Learn From Mistakes on Grade-School Math Problems
Language models have demonstrated remarkable performance in solving reasoning tasks; however, even the strongest models still occasionally make reasoning mistakes. Recently, there has been active research aimed at improving reasoning accuracy, particularly by using pretrained language models to "self-correct" their mistakes via multi-round prompting. In this paper, we follow this line of work but focus on understanding the usefulness of incorporating "error-correction" data directly into the pretraining stage. This data consists of erroneous solution steps immediately followed by their corrections. Using a synthetic math dataset, we show promising results: this type of pretrain data can help language models achieve higher reasoning accuracy directly (i.e., through simple auto-regression, without multi-round prompting) compared to pretraining on the same amount of error-free data. We also delve into many details, such as (1) how this approach differs from beam search, (2) how such data can be prepared, (3) whether masking is needed on the erroneous tokens, (4) the amount of error required, (5) whether such data can be deferred to the fine-tuning stage, and many others.
comment: arXiv admin note: text overlap with arXiv:2407.20311
☆ Measuring the Accuracy of Automatic Speech Recognition Solutions
For d/Deaf and hard of hearing (DHH) people, captioning is an essential accessibility tool. Significant developments in artificial intelligence (AI) mean that Automatic Speech Recognition (ASR) is now a part of many popular applications. This makes creating captions easy and broadly available - but transcription needs high levels of accuracy to be accessible. Scientific publications and industry report very low error rates, claiming AI has reached human parity or even outperforms manual transcription. At the same time the DHH community reports serious issues with the accuracy and reliability of ASR. There seems to be a mismatch between technical innovations and the real-life experience for people who depend on transcription. Independent and comprehensive data is needed to capture the state of ASR. We measured the performance of eleven common ASR services with recordings of Higher Education lectures. We evaluated the influence of technical conditions like streaming, the use of vocabularies, and differences between languages. Our results show that accuracy ranges widely between vendors and for the individual audio samples. We also measured a significant lower quality for streaming ASR, which is used for live events. Our study shows that despite the recent improvements of ASR, common services lack reliability in accuracy.
☆ Enhancing AI-Driven Psychological Consultation: Layered Prompts with Large Language Models
Psychological consultation is essential for improving mental health and well-being, yet challenges such as the shortage of qualified professionals and scalability issues limit its accessibility. To address these challenges, we explore the use of large language models (LLMs) like GPT-4 to augment psychological consultation services. Our approach introduces a novel layered prompting system that dynamically adapts to user input, enabling comprehensive and relevant information gathering. We also develop empathy-driven and scenario-based prompts to enhance the LLM's emotional intelligence and contextual understanding in therapeutic settings. We validated our approach through experiments using a newly collected dataset of psychological consultation dialogues, demonstrating significant improvements in response quality. The results highlight the potential of our prompt engineering techniques to enhance AI-driven psychological consultation, offering a scalable and accessible solution to meet the growing demand for mental health support.
☆ LoraMap: Harnessing the Power of LoRA Connections
Large Language Models (LLMs) can benefit from mitigating hallucinations through fact-checking and overcoming substantial computational overhead with parameter-efficient techniques such as Low-Rank Adaptation (LoRA). While some studies have explored the parallel integration of multiple LoRAs, these approaches need attention to the connections between them. This paper investigates methods to establish connections among multiple LoRAs. We create three reasoning datasets tailored to fact-checking and fine-tune individual LoRAs, allowing them to view and reason from diverse perspectives. Then, we explore strategies for allocating these reasoning LoRAs and introduce LoraMap, an approach to map connections between them. The results on the fact-checking task demonstrate that the performance of LoraMap is superior to LoraHub, an existing LoRA composition method. LoraMap also outperforms with significantly fewer parameters than LoraConcat, which concatenates LoRAs and further fine-tunes them.
comment: 13 pages, 9 figures, 5 tables
☆ Making the Most of your Model: Methods for Finetuning and Applying Pretrained Transformers
This thesis provides methods and analysis of models which make progress on this goal. The techniques outlined are task agnostic, and should provide benefit when used with nearly any transformer LM. We introduce two new finetuning methods which add new capabilities to the models they are used on. The first adds a recurrence mechanism, which removes the fixed-window sized constraint and improves the efficiency of a transformer decoder. The second allows masked language models (MLMs) to be used for initialization of both the encoder and decoder of a non-autoregressive sequence-to-sequence transformer, opening up generative applications of models which were previously only used for natural language understanding tasks. We also introduce two new techniques for improving the quality of predictions of any transformer decoder without additional finetuning. One, hidden state optimization, can be applied to any transformer decoder to improve the quality of predictions at inference time, especially for few-shot classification. The other, conditional beam search, allows practitioners to search for natural language generation (NLG) model outputs with high likelihood while conditioning on the event that the output is not degenerate (e.g. empty, repetitive, etc.). Finally, we provide theoretical and empirical insights on the divergence of model-likelihood and output quality which has widely been observed in prior work. These insights apply to any model which represents a distribution over text, and apply to language models which are not transformers or even autoregressive. We argue that the NLP community has, to some extent, misunderstood the implications of these findings, and encourage a point of view which has more nuance.
comment: PhD thesis
☆ SSDM: Scalable Speech Dysfluency Modeling
Speech dysfluency modeling is the core module for spoken language learning, and speech therapy. However, there are three challenges. First, current state-of-the-art solutions suffer from poor scalability. Second, there is a lack of a large-scale dysfluency corpus. Third, there is not an effective learning framework. In this paper, we propose \textit{SSDM: Scalable Speech Dysfluency Modeling}, which (1) adopts articulatory gestures as scalable forced alignment; (2) introduces connectionist subsequence aligner (CSA) to achieve dysfluency alignment; (3) introduces a large-scale simulated dysfluency corpus called Libri-Dys; and (4) develops an end-to-end system by leveraging the power of large language models (LLMs). We expect SSDM to serve as a standard in the area of dysfluency modeling. Demo is available at \url{https://eureka235.github.io}.
☆ M4CXR: Exploring Multi-task Potentials of Multi-modal Large Language Models for Chest X-ray Interpretation
The rapid evolution of artificial intelligence, especially in large language models (LLMs), has significantly impacted various domains, including healthcare. In chest X-ray (CXR) analysis, previous studies have employed LLMs, but with limitations: either underutilizing the multi-tasking capabilities of LLMs or lacking clinical accuracy. This paper presents M4CXR, a multi-modal LLM designed to enhance CXR interpretation. The model is trained on a visual instruction-following dataset that integrates various task-specific datasets in a conversational format. As a result, the model supports multiple tasks such as medical report generation (MRG), visual grounding, and visual question answering (VQA). M4CXR achieves state-of-the-art clinical accuracy in MRG by employing a chain-of-thought prompting strategy, in which it identifies findings in CXR images and subsequently generates corresponding reports. The model is adaptable to various MRG scenarios depending on the available inputs, such as single-image, multi-image, and multi-study contexts. In addition to MRG, M4CXR performs visual grounding at a level comparable to specialized models and also demonstrates outstanding performance in VQA. Both quantitative and qualitative assessments reveal M4CXR's versatility in MRG, visual grounding, and VQA, while consistently maintaining clinical accuracy.
☆ From cart to truck: meaning shift through words in English in the last two centuries
This onomasiological study uses diachronic word embeddings to explore how different words represented the same concepts over time, using historical word data from 1800 to 2000. We identify shifts in energy, transport, entertainment, and computing domains, revealing connections between language and societal changes. Our approach consisted in using diachronic word embeddings trained using word2vec with skipgram and aligning them using orthogonal Procrustes. We discuss possible difficulties linked to the relationships the method identifies. Moreover, we look at the ethical aspects of interpreting results, highlighting the need for expert insights to understand the method's significance.
comment: 7 pages, 1 figure
☆ ReXamine-Global: A Framework for Uncovering Inconsistencies in Radiology Report Generation Metrics
Given the rapidly expanding capabilities of generative AI models for radiology, there is a need for robust metrics that can accurately measure the quality of AI-generated radiology reports across diverse hospitals. We develop ReXamine-Global, a LLM-powered, multi-site framework that tests metrics across different writing styles and patient populations, exposing gaps in their generalization. First, our method tests whether a metric is undesirably sensitive to reporting style, providing different scores depending on whether AI-generated reports are stylistically similar to ground-truth reports or not. Second, our method measures whether a metric reliably agrees with experts, or whether metric and expert scores of AI-generated report quality diverge for some sites. Using 240 reports from 6 hospitals around the world, we apply ReXamine-Global to 7 established report evaluation metrics and uncover serious gaps in their generalizability. Developers can apply ReXamine-Global when designing new report evaluation metrics, ensuring their robustness across sites. Additionally, our analysis of existing metrics can guide users of those metrics towards evaluation procedures that work reliably at their sites of interest.
☆ Benchmarking Japanese Speech Recognition on ASR-LLM Setups with Multi-Pass Augmented Generative Error Correction
With the strong representational power of large language models (LLMs), generative error correction (GER) for automatic speech recognition (ASR) aims to provide semantic and phonetic refinements to address ASR errors. This work explores how LLM-based GER can enhance and expand the capabilities of Japanese language processing, presenting the first GER benchmark for Japanese ASR with 0.9-2.6k text utterances. We also introduce a new multi-pass augmented generative error correction (MPA GER) by integrating multiple system hypotheses on the input side with corrections from multiple LLMs on the output side and then merging them. To the best of our knowledge, this is the first investigation of the use of LLMs for Japanese GER, which involves second-pass language modeling on the output transcriptions generated by the ASR system (e.g., N-best hypotheses). Our experiments demonstrated performance improvement in the proposed methods of ASR quality and generalization both in SPREDS-U1-ja and CSJ data.
comment: submitted to SLT2024
☆ A longitudinal sentiment analysis of Sinophobia during COVID-19 using large language models
The COVID-19 pandemic has exacerbated xenophobia, particularly Sinophobia, leading to widespread discrimination against individuals of Chinese descent. Large language models (LLMs) are pre-trained deep learning models used for natural language processing (NLP) tasks. The ability of LLMs to understand and generate human-like text makes them particularly useful for analysing social media data to detect and evaluate sentiments. We present a sentiment analysis framework utilising LLMs for longitudinal sentiment analysis of the Sinophobic sentiments expressed in X (Twitter) during the COVID-19 pandemic. The results show a significant correlation between the spikes in Sinophobic tweets, Sinophobic sentiments and surges in COVID-19 cases, revealing that the evolution of the pandemic influenced public sentiment and the prevalence of Sinophobic discourse. Furthermore, the sentiment analysis revealed a predominant presence of negative sentiments, such as annoyance and denial, which underscores the impact of political narratives and misinformation shaping public opinion. The lack of empathetic sentiment which was present in previous studies related to COVID-19 highlights the way the political narratives in media viewed the pandemic and how it blamed the Chinese community. Our study highlights the importance of transparent communication in mitigating xenophobic sentiments during global crises.
☆ Plausible-Parrots @ MSP2023: Enhancing Semantic Plausibility Modeling using Entity and Event Knowledge
In this work, we investigate the effectiveness of injecting external knowledge to a large language model (LLM) to identify semantic plausibility of simple events. Specifically, we enhance the LLM with fine-grained entity types, event types and their definitions extracted from an external knowledge base. These knowledge are injected into our system via designed templates. We also augment the data to balance the label distribution and adapt the task setting to real world scenarios in which event mentions are expressed as natural language sentences. The experimental results show the effectiveness of the injected knowledge on modeling semantic plausibility of events. An error analysis further emphasizes the importance of identifying non-trivial entity and event types.
comment: 10 pages, 5 figures, 5 tables
☆ Event Extraction for Portuguese: A QA-driven Approach using ACE-2005
Event extraction is an Information Retrieval task that commonly consists of identifying the central word for the event (trigger) and the event's arguments. This task has been extensively studied for English but lags behind for Portuguese, partly due to the lack of task-specific annotated corpora. This paper proposes a framework in which two separated BERT-based models were fine-tuned to identify and classify events in Portuguese documents. We decompose this task into two sub-tasks. Firstly, we use a token classification model to detect event triggers. To extract event arguments, we train a Question Answering model that queries the triggers about their corresponding event argument roles. Given the lack of event annotated corpora in Portuguese, we translated the original version of the ACE-2005 dataset (a reference in the field) into Portuguese, producing a new corpus for Portuguese event extraction. To accomplish this, we developed an automatic translation pipeline. Our framework obtains F1 marks of 64.4 for trigger classification and 46.7 for argument classification setting, thus a new state-of-the-art reference for these tasks in Portuguese.
☆ ACE-2005-PT: Corpus for Event Extraction in Portuguese
Event extraction is an NLP task that commonly involves identifying the central word (trigger) for an event and its associated arguments in text. ACE-2005 is widely recognised as the standard corpus in this field. While other corpora, like PropBank, primarily focus on annotating predicate-argument structure, ACE-2005 provides comprehensive information about the overall event structure and semantics. However, its limited language coverage restricts its usability. This paper introduces ACE-2005-PT, a corpus created by translating ACE-2005 into Portuguese, with European and Brazilian variants. To speed up the process of obtaining ACE-2005-PT, we rely on automatic translators. This, however, poses some challenges related to automatically identifying the correct alignments between multi-word annotations in the original text and in the corresponding translated sentence. To achieve this, we developed an alignment pipeline that incorporates several alignment techniques: lemmatization, fuzzy matching, synonym matching, multiple translations and a BERT-based word aligner. To measure the alignment effectiveness, a subset of annotations from the ACE-2005-PT corpus was manually aligned by a linguist expert. This subset was then compared against our pipeline results which achieved exact and relaxed match scores of 70.55\% and 87.55\% respectively. As a result, we successfully generated a Portuguese version of the ACE-2005 corpus, which has been accepted for publication by LDC.
☆ Exploring Multiple Strategies to Improve Multilingual Coreference Resolution in CorefUD
Coreference resolution, the task of identifying expressions in text that refer to the same entity, is a critical component in various natural language processing (NLP) applications. This paper presents our end-to-end neural coreference resolution system, utilizing the CorefUD 1.1 dataset, which spans 17 datasets across 12 languages. We first establish strong baseline models, including monolingual and cross-lingual variations, and then propose several extensions to enhance performance across diverse linguistic contexts. These extensions include cross-lingual training, incorporation of syntactic information, a Span2Head model for optimized headword prediction, and advanced singleton modeling. We also experiment with headword span representation and long-documents modeling through overlapping segments. The proposed extensions, particularly the heads-only approach, singleton modeling, and long document prediction significantly improve performance across most datasets. We also perform zero-shot cross-lingual experiments, highlighting the potential and limitations of cross-lingual transfer in coreference resolution. Our findings contribute to the development of robust and scalable coreference systems for multilingual coreference resolution. Finally, we evaluate our model on CorefUD 1.1 test set and surpass the best model from CRAC 2023 shared task of a comparable size by a large margin. Our nodel is available on GitHub: \url{https://github.com/ondfa/coref-multiling}
☆ LLaVA-Chef: A Multi-modal Generative Model for Food Recipes
In the rapidly evolving landscape of online recipe sharing within a globalized context, there has been a notable surge in research towards comprehending and generating food recipes. Recent advancements in large language models (LLMs) like GPT-2 and LLaVA have paved the way for Natural Language Processing (NLP) approaches to delve deeper into various facets of food-related tasks, encompassing ingredient recognition and comprehensive recipe generation. Despite impressive performance and multi-modal adaptability of LLMs, domain-specific training remains paramount for their effective application. This work evaluates existing LLMs for recipe generation and proposes LLaVA-Chef, a novel model trained on a curated dataset of diverse recipe prompts in a multi-stage approach. First, we refine the mapping of visual food image embeddings to the language space. Second, we adapt LLaVA to the food domain by fine-tuning it on relevant recipe data. Third, we utilize diverse prompts to enhance the model's recipe comprehension. Finally, we improve the linguistic quality of generated recipes by penalizing the model with a custom loss function. LLaVA-Chef demonstrates impressive improvements over pretrained LLMs and prior works. A detailed qualitative analysis reveals that LLaVA-Chef generates more detailed recipes with precise ingredient mentions, compared to existing approaches.
☆ Modeling offensive content detection for TikTok
The advent of social media transformed interpersonal communication and information consumption processes. This digital landscape accommodates user intentions, also resulting in an increase of offensive language and harmful behavior. Concurrently, social media platforms collect vast datasets comprising user-generated content and behavioral information. These datasets are instrumental for platforms deploying machine learning and data-driven strategies, facilitating customer insights and countermeasures against social manipulation mechanisms like disinformation and offensive content. Nevertheless, the availability of such datasets, along with the application of various machine learning techniques, to researchers and practitioners, for specific social media platforms regarding particular events, is limited. In particular for TikTok, which offers unique tools for personalized content creation and sharing, the existing body of knowledge would benefit from having diverse comprehensive datasets and associated data analytics solutions on offensive content. While efforts from social media platforms, research, and practitioner communities are seen on this behalf, such content continues to proliferate. This translates to an essential need to make datasets publicly available and build corresponding intelligent solutions. On this behalf, this research undertakes the collection and analysis of TikTok data containing offensive content, building a series of machine learning and deep learning models for offensive content detection. This is done aiming at answering the following research question: "How to develop a series of computational models to detect offensive content on TikTok?". To this end, a Data Science methodological approach is considered, 120.423 TikTok comments are collected, and on a balanced, binary classification approach, F1 score performance results of 0.863 is obtained.
comment: Accepted as a conference paper at DPSH 2024, 8 pages
☆ See or Guess: Counterfactually Regularized Image Captioning ACM MM 2024
Image captioning, which generates natural language descriptions of the visual information in an image, is a crucial task in vision-language research. Previous models have typically addressed this task by aligning the generative capabilities of machines with human intelligence through statistical fitting of existing datasets. While effective for normal images, they may struggle to accurately describe those where certain parts of the image are obscured or edited, unlike humans who excel in such cases. These weaknesses they exhibit, including hallucinations and limited interpretability, often hinder performance in scenarios with shifted association patterns. In this paper, we present a generic image captioning framework that employs causal inference to make existing models more capable of interventional tasks, and counterfactually explainable. Our approach includes two variants leveraging either total effect or natural direct effect. Integrating them into the training process enables models to handle counterfactual scenarios, increasing their generalizability. Extensive experiments on various datasets show that our method effectively reduces hallucinations and improves the model's faithfulness to images, demonstrating high portability across both small-scale and large-scale image-to-text models. The code is available at https://github.com/Aman-4-Real/See-or-Guess.
comment: Accepted by ACM MM 2024
♻ ☆ Awes, Laws, and Flaws From Today's LLM Research
We perform a critical examination of the scientific methodology behind contemporary large language model (LLM) research. For this we assess over 2,000 research works based on criteria typical of what is considered good research (e.g. presence of statistical tests and reproducibility) and cross-validate it with arguments that are at the centre of controversy (e.g., claims of emergent behaviour, the use of LLMs as evaluators). We find multiple trends, such as declines in claims of emergent behaviour and ethics disclaimers; the rise of LLMs as evaluators in spite of a lack of consensus from the community about their useability; and an increase of claims of LLM reasoning abilities, typically without leveraging human evaluation. This paper underscores the need for more scrutiny and rigour by and from this field to live up to the fundamentals of a responsible scientific method that is ethical, reproducible, systematic, and open to criticism.
comment: Under review -- v1 was an old draft with an unrevised abstract (oops)
♻ ☆ CC-GPX: Extracting High-Quality Annotated Geospatial Data from Common Crawl SP
The Common Crawl (CC) corpus is the largest open web crawl dataset containing 9.5+ petabytes of data captured since 2008. The dataset is instrumental in training large language models, and as such it has been studied for (un)desirable content, and distilled for smaller, domain-specific datasets. However, to our knowledge, no research has been dedicated to using CC as a source of annotated geospatial data. In this paper, we introduce an efficient pipeline to extract annotated user-generated tracks from GPX files found in CC, and the resulting multimodal dataset with 1,416 pairings of human-written descriptions and MultiLineString vector data from the 6 most recent CC releases. The dataset can be used to study people's outdoor activity patterns, the way people talk about their outdoor experiences, as well as for developing trajectory generation or track annotation models, or for various other problems in place of synthetically generated routes. Our reproducible code is available on GitHub: https://github.com/ilyankou/cc-gpx
comment: Accepted as a poster to ACM SIGSPATIAL 2024
♻ ☆ Quantifying Geospatial in the Common Crawl Corpus SP
Large language models (LLMs) exhibit emerging geospatial capabilities, stemming from their pre-training on vast unlabelled text datasets that are often derived from the Common Crawl (CC) corpus. However, the geospatial content within CC remains largely unexplored, impacting our understanding of LLMs' spatial reasoning. This paper investigates the prevalence of geospatial data in recent Common Crawl releases using Gemini 1.5, a powerful language model. By analyzing a sample of documents and manually revising the results, we estimate that 18.7% of web documents in CC contain geospatial information such as coordinates and addresses. We find little difference in prevalence between Enlgish- and non-English-language documents. Our findings provide quantitative insights into the nature and extent of geospatial data in CC, and lay the groundwork for future studies of geospatial biases of LLMs.
comment: Accepted as a poster to ACM SIGSPATIAL 2024
♻ ☆ GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM
Key-value (KV) caching has become the de-facto to accelerate generation speed for large language models (LLMs) inference. However, the growing cache demand with increasing sequence length has transformed LLM inference to be a memory bound problem, significantly constraining the system throughput. Existing methods rely on dropping unimportant tokens or quantizing all entries uniformly. Such methods, however, often incur high approximation errors to represent the compressed matrices. The autoregressive decoding process further compounds the error of each step, resulting in critical deviation in model generation and deterioration of performance. To tackle this challenge, we propose GEAR, an efficient KV cache compression framework that achieves near-lossless high-ratio compression. GEAR first applies quantization to majority of entries of similar magnitudes to ultra-low precision. It then employs a low rank matrix to approximate the quantization error, and a sparse matrix to remedy individual errors from outlier entries. By adeptly integrating three techniques, GEAR is able to fully exploit their synergistic potentials. Our experiments demonstrate that compared to alternatives, GEAR achieves near-lossless 4-bit KV cache compression with up to 2.38x throughput improvement, while reducing peak-memory size up to 2.29x. Our code is publicly available at https://github.com/HaoKang-Timmy/GEAR.
♻ ☆ Not (yet) the whole story: Evaluating Visual Storytelling Requires More than Measuring Coherence, Grounding, and Repetition
Visual storytelling consists in generating a natural language story given a temporally ordered sequence of images. This task is not only challenging for models, but also very difficult to evaluate with automatic metrics since there is no consensus about what makes a story 'good'. In this paper, we introduce a novel method that measures story quality in terms of human likeness regarding three key aspects highlighted in previous work: visual grounding, coherence, and repetitiveness. We then use this method to evaluate the stories generated by several models, showing that the foundation model LLaVA obtains the best result, but only slightly so compared to TAPM, a 50-times smaller visual storytelling model. Upgrading the visual and language components of TAPM results in a model that yields competitive performance with a relatively low number of parameters. Finally, we carry out a human evaluation study, whose results suggest that a 'good' story may require more than a human-like level of visual grounding, coherence, and repetition.
♻ ☆ Smart Multi-Modal Search: Contextual Sparse and Dense Embedding Integration in Adobe Express CIKM 2024
As user content and queries become increasingly multi-modal, the need for effective multi-modal search systems has grown. Traditional search systems often rely on textual and metadata annotations for indexed images, while multi-modal embeddings like CLIP enable direct search using text and image embeddings. However, embedding-based approaches face challenges in integrating contextual features such as user locale and recency. Building a scalable multi-modal search system requires fine-tuning several components. This paper presents a multi-modal search architecture and a series of AB tests that optimize embeddings and multi-modal technologies in Adobe Express template search. We address considerations such as embedding model selection, the roles of embeddings in matching and ranking, and the balance between dense and sparse embeddings. Our iterative approach demonstrates how utilizing sparse, dense, and contextual features enhances short and long query search, significantly reduces null rates (over 70\%), and increases click-through rates (CTR). Our findings provide insights into developing robust multi-modal search systems, thereby enhancing relevance for complex queries.
comment: CIKM 2024 (International Conference on Information and Knowledge Management), Multimodal Search and Recommendations Workshop
♻ ☆ Mitigating Exaggerated Safety in Large Language Models
As the popularity of Large Language Models (LLMs) grow, combining model safety with utility becomes increasingly important. The challenge is making sure that LLMs can recognize and decline dangerous prompts without sacrificing their ability to be helpful. The problem of "exaggerated safety" demonstrates how difficult this can be. To reduce excessive safety behaviours -- which was discovered to be 26.1% of safe prompts being misclassified as dangerous and refused -- we use a combination of XSTest dataset prompts as well as interactive, contextual, and few-shot prompting to examine the decision bounds of LLMs such as Llama2, Gemma Command R+, and Phi-3. We find that few-shot prompting works best for Llama2, interactive prompting works best Gemma, and contextual prompting works best for Command R+ and Phi-3. Using a combination of these prompting strategies, we are able to mitigate exaggerated safety behaviors by an overall 92.9% across all LLMs. Our work presents a multiple prompting strategies to jailbreak LLMs' decision-making processes, allowing them to navigate the tight line between refusing unsafe prompts and remaining helpful.
comment: 17 pages, 8 figures, 2 tables
♻ ☆ Adaptive Reinforcement Learning Planning: Harnessing Large Language Models for Complex Information Extraction
Existing research on large language models (LLMs) shows that they can solve information extraction tasks through multi-step planning. However, their extraction behavior on complex sentences and tasks is unstable, emerging issues such as false positives and missing elements. We observe that decomposing complex extraction tasks and extracting them step by step can effectively improve LLMs' performance, and the extraction orders of entities significantly affect the final results of LLMs. This paper proposes a two-stage multi-step method for LLM-based information extraction and adopts the RL framework to execute the multi-step planning. We regard sequential extraction as a Markov decision process, build an LLM-based extraction environment, design a decision module to adaptively provide the optimal order for sequential entity extraction on different sentences, and utilize the DDQN algorithm to train the decision model. We also design the rewards and evaluation metrics suitable for the extraction results of LLMs. We conduct extensive experiments on multiple public datasets to demonstrate the effectiveness of our method in improving the information extraction capabilities of LLMs.
♻ ☆ Conan-embedding: General Text Embedding with More and Better Negative Samples
With the growing popularity of RAG, the capabilities of embedding models are gaining increasing attention. Embedding models are primarily trained through contrastive loss learning, with negative examples being a key component. Previous work has proposed various hard negative mining strategies, but these strategies are typically employed as preprocessing steps. In this paper, we propose the conan-embedding model, which maximizes the utilization of more and higher-quality negative examples. Specifically, since the model's ability to handle preprocessed negative examples evolves during training, we propose dynamic hard negative mining method to expose the model to more challenging negative examples throughout the training process. Secondly, contrastive learning requires as many negative examples as possible but is limited by GPU memory constraints. Therefore, we use a Cross-GPU balancing Loss to provide more negative examples for embedding training and balance the batch size across multiple tasks. Moreover, we also discovered that the prompt-response pairs from LLMs can be used for embedding training. Our approach effectively enhances the capabilities of embedding models, currently ranking first on the Chinese leaderboard of Massive text embedding benchmark
♻ ☆ Innovative Speech-Based Deep Learning Approaches for Parkinson's Disease Classification: A Systematic Review
Parkinson's disease (PD), the second most prevalent neurodegenerative disorder worldwide, frequently presents with early-stage speech impairments. Recent advancements in Artificial Intelligence (AI), particularly deep learning (DL), have significantly enhanced PD diagnosis through the analysis of speech data. Nevertheless, the progress of research is restricted by the limited availability of publicly accessible speech-based PD datasets, primarily due to privacy concerns. The goal of this systematic review is to explore the current landscape of speech-based DL approaches for PD classification, based on 33 scientific works published between 2020 and March 2024. We discuss their available resources, capabilities, potential limitations, and issues related to bias, explainability, and privacy. Furthermore, this review provides an overview of publicly accessible speech-based datasets and open-source material for PD. The DL approaches are categorized into end-to-end (E2E) learning, transfer learning (TL) and deep acoustic features extraction (DAFE) approaches. Among E2E approaches, Convolutional Neural Networks (CNNs) are prevalent, though Transformers are increasingly popular. E2E approaches face challenges such as limited data and computational resources, especially with Transformers. TL addresses these issues by providing more robust PD diagnosis and better generalizability across languages. DAFE aims to improve the explainability and interpretability of results by examining the specific effects of deep features on both other DL approaches and more traditional machine learning (ML) methods. However, it often underperforms compared to E2E and TL approaches.
comment: Submitted in Applied Sciences - peer reviewed Open Access journal. This research was funded by the NWO research programme AiNed Fellowship Grants under the project Responsible AI for Voice Diagnostics (RAIVD) - grant number NGF.1607.22.013
♻ ☆ Can LLMs perform structured graph reasoning? ICPR
Pretrained Large Language Models (LLMs) have demonstrated various reasoning capabilities through language-based prompts alone, particularly in unstructured task settings (tasks purely based on language semantics). However, LLMs often struggle with structured tasks, because of the inherent incompatibility of input representation. Reducing structured tasks to uni-dimensional language semantics often renders the problem trivial. Keeping the trade-off between LLM compatibility and structure complexity in mind, we design various graph reasoning tasks as a proxy to semi-structured tasks in this paper, in order to test the ability to navigate through representations beyond plain text in various LLMs. Particularly, we design 10 distinct problems of graph traversal, each representing increasing levels of complexity, and benchmark 5 different instruct-finetuned LLMs (GPT-4, GPT-3.5, Claude-2, Llama-2 and Palm-2) on the aforementioned tasks. Further, we analyse the performance of models across various settings such as varying sizes of graphs as well as different forms of k-shot prompting. We highlight various limitations, biases and properties of LLMs through this benchmarking process, such as an inverse relation to the average degrees of freedom of traversal per node in graphs, the overall negative impact of k-shot prompting on graph reasoning tasks, and a positive response bias which prevents LLMs from identifying the absence of a valid solution. Finally, we introduce a new prompting technique specially designed for graph traversal tasks (PathCompare), which demonstrates a notable increase in the performance of LLMs in comparison to standard prompting techniques such as Chain-of-Thought (CoT).
comment: International Conference on Pattern Recognition (ICPR), 2024
♻ ☆ The Odyssey of Commonsense Causality: From Foundational Benchmarks to Cutting-Edge Reasoning
Understanding commonsense causality is a unique mark of intelligence for humans. It helps people understand the principles of the real world better and benefits the decision-making process related to causation. For instance, commonsense causality is crucial in judging whether a defendant's action causes the plaintiff's loss in determining legal liability. Despite its significance, a systematic exploration of this topic is notably lacking. Our comprehensive survey bridges this gap by focusing on taxonomies, benchmarks, acquisition methods, qualitative reasoning, and quantitative measurements in commonsense causality, synthesizing insights from over 200 representative articles. Our work aims to provide a systematic overview, update scholars on recent advancements, provide a pragmatic guide for beginners, and highlight promising future research directions in this vital field.
comment: 42 pages
♻ ☆ Inverse-Q*: Token Level Reinforcement Learning for Aligning Large Language Models Without Preference Data
Reinforcement Learning from Human Feedback (RLHF) has proven effective in aligning large language models with human intentions, yet it often relies on complex methodologies like Proximal Policy Optimization (PPO) that require extensive hyper-parameter tuning and present challenges in sample efficiency and stability. In this paper, we introduce Inverse-Q*, an innovative framework that transcends traditional RL methods by optimizing token-level reinforcement learning without the need for additional reward or value models. Inverse-Q* leverages direct preference optimization techniques but extends them by estimating the conditionally optimal policy directly from the model's responses, facilitating more granular and flexible policy shaping. Our approach reduces reliance on human annotation and external supervision, making it especially suitable for low-resource settings. We present extensive experimental results demonstrating that Inverse-Q* not only matches but potentially exceeds the effectiveness of PPO in terms of convergence speed and the alignment of model responses with human preferences. Our findings suggest that Inverse-Q* offers a practical and robust alternative to conventional RLHF approaches, paving the way for more efficient and adaptable model training approaches.
♻ ☆ IKUN for WMT24 General MT Task: LLMs Are here for Multilingual Machine Translation
This paper introduces two multilingual systems, IKUN and IKUN-C, developed for the general machine translation task in WMT24. IKUN and IKUN-C represent an open system and a constrained system, respectively, built on Llama-3-8b and Mistral-7B-v0.3. Both systems are designed to handle all 11 language directions using a single model. According to automatic evaluation metrics, IKUN-C achieved 6 first-place and 3 second-place finishes among all constrained systems, while IKUN secured 1 first-place and 2 second-place finishes across both open and constrained systems. These encouraging results suggest that large language models (LLMs) are nearing the level of proficiency required for effective multilingual machine translation. The systems are based on a two-stage approach: first, continuous pre-training on monolingual data in 10 languages, followed by fine-tuning on high-quality parallel data for 11 language directions. The primary difference between IKUN and IKUN-C lies in their monolingual pre-training strategy. IKUN-C is pre-trained using constrained monolingual data, whereas IKUN leverages monolingual data from the OSCAR dataset. In the second phase, both systems are fine-tuned on parallel data sourced from NTREX, Flores, and WMT16-23 for all 11 language pairs.
comment: typo: 120K -> 12K vocabulary size
♻ ☆ ReMamba: Equip Mamba with Effective Long-Sequence Modeling
While the Mamba architecture demonstrates superior inference efficiency and competitive performance on short-context natural language processing (NLP) tasks, empirical evidence suggests its capacity to comprehend long contexts is limited compared to transformer-based models. In this study, we investigate the long-context efficiency issues of the Mamba models and propose ReMamba, which enhances Mamba's ability to comprehend long contexts. ReMamba incorporates selective compression and adaptation techniques within a two-stage re-forward process, incurring minimal additional inference costs overhead. Experimental results on the LongBench and L-Eval benchmarks demonstrate ReMamba's efficacy, improving over the baselines by 3.2 and 1.6 points, respectively, and attaining performance almost on par with same-size transformer models.
♻ ☆ A Preference-driven Paradigm for Enhanced Translation with Large Language Models NAACL 2024
Recent research has shown that large language models (LLMs) can achieve remarkable translation performance through supervised fine-tuning (SFT) using only a small amount of parallel data. However, SFT simply instructs the model to imitate the reference translations at the token level, making it vulnerable to the noise present in the references. Hence, the assistance from SFT often reaches a plateau once the LLMs have achieved a certain level of translation capability, and further increasing the size of parallel data does not provide additional benefits. To overcome this plateau associated with imitation-based SFT, we propose a preference-based approach built upon the Plackett-Luce model. The objective is to steer LLMs towards a more nuanced understanding of translation preferences from a holistic view, while also being more resilient in the absence of gold translations. We further build a dataset named MAPLE to verify the effectiveness of our approach, which includes multiple translations of varying quality for each source sentence. Extensive experiments demonstrate the superiority of our approach in "breaking the plateau" across diverse LLMs and test settings. Our in-depth analysis underscores the pivotal role of diverse translations and accurate preference scores in the success of our approach.
comment: Accepted to NAACL 2024 (long, main)
♻ ☆ TEncDM: Understanding the Properties of Diffusion Model in the Space of Language Model Encodings
This paper presents the Text Encoding Diffusion Model (TEncDM), a novel approach to diffusion modeling that operates in the space of pre-trained language model encodings. In contrast to traditionally used embeddings, encodings integrate contextual information. In our approach, we also employ a transformer-based decoder, specifically designed to incorporate context in the token prediction process. We conduct a comprehensive examination of the influence of the encoder, decoder, noise scheduler, and self-conditioning on zero-shot generation. Furthermore, we compare TEncDM with previous approaches on three conditional text generation tasks: QQP, XSum, and Wiki-Auto. The results show that TEncDM exhibits superior performance compared to existing non-autoregressive diffusion models.
comment: 14 pages, 13 figures
♻ ☆ Helmsman of the Masses? Evaluate the Opinion Leadership of Large Language Models in the Werewolf Game
Large language models (LLMs) have exhibited memorable strategic behaviors in social deductive games. However, the significance of opinion leadership exhibited by LLM-based agents has been largely overlooked, which is crucial for practical applications in multi-agent and human-AI interaction settings. Opinion leaders are individuals who have a noticeable impact on the beliefs and behaviors of others within a social group. In this work, we employ the Werewolf game as a simulation platform to assess the opinion leadership of LLMs. The game includes the role of the Sheriff, tasked with summarizing arguments and recommending decision options, and therefore serves as a credible proxy for an opinion leader. We develop a framework integrating the Sheriff role and devise two novel metrics based on the critical characteristics of opinion leaders. The first metric measures the reliability of the opinion leader, and the second assesses the influence of the opinion leader on other players' decisions. We conduct extensive experiments to evaluate LLMs of different scales. In addition, we collect a Werewolf question-answering dataset (WWQA) to assess and enhance LLM's grasp of the game rules, and we also incorporate human participants for further analysis. The results suggest that the Werewolf game is a suitable test bed to evaluate the opinion leadership of LLMs, and few LLMs possess the capacity for opinion leadership.
comment: Published as a conference paper at COLM 2024. 37 pages, 6 figures, 27 tables
♻ ☆ MaskMoE: Boosting Token-Level Learning via Routing Mask in Mixture-of-Experts
Scaling the size of a model enhances its capabilities but significantly increases computation complexity. Mixture-of-Experts models (MoE) address the issue by allowing model size to scale up without substantially increasing training or inference costs. In MoE, there is an important module called the router, which is used to distribute each token to the experts. Currently, the mainstream routing methods include dynamic routing and fixed routing. Despite their promising results, MoE models encounter several challenges. Primarily, for dynamic routing methods, the dispersion of training tokens across multiple experts can lead to underfitting, particularly for infrequent tokens. Additionally, though fixed routing methods can mitigate that issue, they compromise on the diversity of representations. In this paper, we propose \textbf{MaskMoE}, a method designed to enhance token-level learning by employing a routing \textbf{mask}ing technique within the \textbf{M}ixture-\textbf{o}f-\textbf{E}xperts model. MaskMoE is capable of maintaining representation diversity while achieving more comprehensive training. Experimental results demonstrate that our method outperforms previous dominant Mixture-of-Experts models in terms of both perplexity (PPL) and downstream task performance.
comment: Work in progress
♻ ☆ LRP4RAG: Detecting Hallucinations in Retrieval-Augmented Generation via Layer-wise Relevance Propagation
Retrieval-Augmented Generation (RAG) has become a primary technique for mitigating hallucinations in large language models (LLMs). However, incomplete knowledge extraction and insufficient understanding can still mislead LLMs to produce irrelevant or even contradictory responses, which means hallucinations persist in RAG. In this paper, we propose LRP4RAG, a method based on the Layer-wise Relevance Propagation (LRP) algorithm for detecting hallucinations in RAG. Specifically, we first utilize LRP to compute the relevance between the input and output of the RAG generator. We then apply further extraction and resampling to the relevance matrix. The processed relevance data are input into multiple classifiers to determine whether the output contains hallucinations. To the best of our knowledge, this is the first time that LRP has been used for detecting RAG hallucinations, and extensive experiments demonstrate that LRP4RAG outperforms existing baselines.
♻ ☆ PsychoGAT: A Novel Psychological Measurement Paradigm through Interactive Fiction Games with LLM Agents ACL 2024
Psychological measurement is essential for mental health, self-understanding, and personal development. Traditional methods, such as self-report scales and psychologist interviews, often face challenges with engagement and accessibility. While game-based and LLM-based tools have been explored to improve user interest and automate assessment, they struggle to balance engagement with generalizability. In this work, we propose PsychoGAT (Psychological Game AgenTs) to achieve a generic gamification of psychological assessment. The main insight is that powerful LLMs can function both as adept psychologists and innovative game designers. By incorporating LLM agents into designated roles and carefully managing their interactions, PsychoGAT can transform any standardized scales into personalized and engaging interactive fiction games. To validate the proposed method, we conduct psychometric evaluations to assess its effectiveness and employ human evaluators to examine the generated content across various psychological constructs, including depression, cognitive distortions, and personality traits. Results demonstrate that PsychoGAT serves as an effective assessment tool, achieving statistically significant excellence in psychometric metrics such as reliability, convergent validity, and discriminant validity. Moreover, human evaluations confirm PsychoGAT's enhancements in content coherence, interactivity, interest, immersion, and satisfaction.
comment: ACL 2024
♻ ☆ Internal Consistency and Self-Feedback in Large Language Models: A Survey
Large language models (LLMs) often exhibit deficient reasoning or generate hallucinations. To address these, studies prefixed with "Self-" such as Self-Consistency, Self-Improve, and Self-Refine have been initiated. They share a commonality: involving LLMs evaluating and updating themselves. Nonetheless, these efforts lack a unified perspective on summarization, as existing surveys predominantly focus on categorization. In this paper, we summarize a theoretical framework, Internal Consistency, offering explanations for reasoning deficiencies and hallucinations. Internal Consistency refers to the consistency in expressions among LLMs' latent, decoding, or response layers based on sampling methodologies. Then, we introduce another effective theoretical framework capable of mining Internal Consistency, named Self-Feedback. This framework consists of two modules: Self-Evaluation and Self-Update. The former captures Internal Consistency Signals, while the latter leverages the signals to enhance either the model's response or the model itself. This framework has been employed in numerous studies. We systematically classify these studies by tasks and lines of work; summarize relevant evaluation methods and benchmarks; and delve into the concern, "Does Self-Feedback Really Work?" We also propose several critical viewpoints, including the "Hourglass Evolution of Internal Consistency", "Consistency Is (Almost) Correctness" hypothesis, and "The Paradox of Latent and Explicit Reasoning". The relevant resources are open-sourced at https://github.com/IAAR-Shanghai/ICSFSurvey.
comment: 24 pages, 9 figures, 7 tables, 14 equations
♻ ☆ InstructERC: Reforming Emotion Recognition in Conversation with Multi-task Retrieval-Augmented Large Language Models
The field of emotion recognition of conversation (ERC) has been focusing on separating sentence feature encoding and context modeling, lacking exploration in generative paradigms based on unified designs. In this study, we propose a novel approach, InstructERC, to reformulate the ERC task from a discriminative framework to a generative framework based on Large Language Models (LLMs). InstructERC makes three significant contributions: (1) it introduces a simple yet effective retrieval template module, which helps the model explicitly integrate multi-granularity dialogue supervision information. (2) We introduce two additional emotion alignment tasks, namely speaker identification and emotion prediction tasks, to implicitly model the dialogue role relationships and future emotional tendencies in conversations. (3) Pioneeringly, we unify emotion labels across benchmarks through the feeling wheel to fit real application scenarios. InstructERC still perform impressively on this unified dataset. Our LLM-based plugin framework significantly outperforms all previous models and achieves comprehensive SOTA on three commonly used ERC datasets. Extensive analysis of parameter-efficient and data-scaling experiments provides empirical guidance for applying it in practical scenarios.
♻ ☆ TF-Attack: Transferable and Fast Adversarial Attacks on Large Language Models
With the great advancements in large language models (LLMs), adversarial attacks against LLMs have recently attracted increasing attention. We found that pre-existing adversarial attack methodologies exhibit limited transferability and are notably inefficient, particularly when applied to LLMs. In this paper, we analyze the core mechanisms of previous predominant adversarial attack methods, revealing that 1) the distributions of importance score differ markedly among victim models, restricting the transferability; 2) the sequential attack processes induces substantial time overheads. Based on the above two insights, we introduce a new scheme, named TF-Attack, for Transferable and Fast adversarial attacks on LLMs. TF-Attack employs an external LLM as a third-party overseer rather than the victim model to identify critical units within sentences. Moreover, TF-Attack introduces the concept of Importance Level, which allows for parallel substitutions of attacks. We conduct extensive experiments on 6 widely adopted benchmarks, evaluating the proposed method through both automatic and human metrics. Results show that our method consistently surpasses previous methods in transferability and delivers significant speed improvements, up to 20 times faster than earlier attack strategies.
comment: 14 pages, 6 figures. arXiv admin note: text overlap with arXiv:2305.17440 by other authors
♻ ☆ BEYOND DIALOGUE: A Profile-Dialogue Alignment Framework Towards General Role-Playing Language Model
The rapid advancement of large language models (LLMs) has revolutionized role-playing, enabling the development of general role-playing models. However, current role-playing training has two significant issues: (I) Using a predefined role profile to prompt dialogue training for specific scenarios usually leads to inconsistencies and even conflicts between the dialogue and the profile, resulting in training biases. (II) The model learns to imitate the role based solely on the profile, neglecting profile-dialogue alignment at the sentence level. In this work, we propose a simple yet effective framework called BEYOND DIALOGUE, designed to overcome these hurdles. This framework innovatively introduces "beyond dialogue" tasks to align dialogue with profile traits based on each specific scenario, thereby eliminating biases during training. Furthermore, by adopting an innovative prompting mechanism that generates reasoning outcomes for training, the framework allows the model to achieve fine-grained alignment between profile and dialogue at the sentence level. The aforementioned methods are fully automated and low-cost. Additionally, the integration of automated dialogue and objective evaluation methods forms a comprehensive framework, paving the way for general role-playing. Experimental results demonstrate that our model excels in adhering to and reflecting various dimensions of role profiles, outperforming most proprietary general and specialized role-playing baselines. All code and datasets are available at https://github.com/yuyouyu32/BeyondDialogue.
♻ ☆ GenRec: Generative Sequential Recommendation with Large Language Models
Sequential recommendation is a task to capture hidden user preferences from historical user item interaction data and recommend next items for the user. Significant progress has been made in this domain by leveraging classification based learning methods. Inspired by the recent paradigm of 'pretrain, prompt and predict' in NLP, we consider sequential recommendation as a sequence to sequence generation task and propose a novel model named Generative Recommendation (GenRec). Unlike classification based models that learn explicit user and item representations, GenRec utilizes the sequence modeling capability of Transformer and adopts the masked item prediction objective to effectively learn the hidden bidirectional sequential patterns. Different from existing generative sequential recommendation models, GenRec does not rely on manually designed hard prompts. The input to GenRec is textual user item sequence and the output is top ranked next items. Moreover, GenRec is lightweight and requires only a few hours to train effectively in low-resource settings, making it highly applicable to real-world scenarios and helping to democratize large language models in the sequential recommendation domain. Our extensive experiments have demonstrated that GenRec generalizes on various public real-world datasets and achieves state-of-the-art results. Our experiments also validate the effectiveness of the the proposed masked item prediction objective that improves the model performance by a large margin.
♻ ☆ TF-Attack: Transferable and Fast Adversarial Attacks on Large Language Models
With the great advancements in large language models (LLMs), adversarial attacks against LLMs have recently attracted increasing attention. We found that pre-existing adversarial attack methodologies exhibit limited transferability and are notably inefficient, particularly when applied to LLMs. In this paper, we analyze the core mechanisms of previous predominant adversarial attack methods, revealing that 1) the distributions of importance score differ markedly among victim models, restricting the transferability; 2) the sequential attack processes induces substantial time overheads. Based on the above two insights, we introduce a new scheme, named TF-Attack, for Transferable and Fast adversarial attacks on LLMs. TF-Attack employs an external LLM as a third-party overseer rather than the victim model to identify critical units within sentences. Moreover, TF-Attack introduces the concept of Importance Level, which allows for parallel substitutions of attacks. We conduct extensive experiments on 6 widely adopted benchmarks, evaluating the proposed method through both automatic and human metrics. Results show that our method consistently surpasses previous methods in transferability and delivers significant speed improvements, up to 20 times faster than earlier attack strategies.
comment: 14 pages, 6 figures
♻ ☆ CReMa: Crisis Response through Computational Identification and Matching of Cross-Lingual Requests and Offers Shared on Social Media
During times of crisis, social media platforms play a crucial role in facilitating communication and coordinating resources. In the midst of chaos and uncertainty, communities often rely on these platforms to share urgent pleas for help, extend support, and organize relief efforts. However, the overwhelming volume of conversations during such periods can escalate to unprecedented levels, necessitating the automated identification and matching of requests and offers to streamline relief operations. Additionally, there is a notable absence of studies conducted in multi-lingual settings, despite the fact that any geographical area can have a diverse linguistic population. Therefore, we propose CReMa (Crisis Response Matcher), a systematic approach that integrates textual, temporal, and spatial features to address the challenges of effectively identifying and matching requests and offers on social media platforms during emergencies. Our approach utilizes a crisis-specific pre-trained model and a multi-lingual embedding space. We emulate human decision-making to compute temporal and spatial features and non-linearly weigh the textual features. The results from our experiments are promising, outperforming strong baselines. Additionally, we introduce a novel multi-lingual dataset simulating help-seeking and offering assistance on social media in 16 languages and conduct comprehensive cross-lingual experiments. Furthermore, we analyze a million-scale geotagged global dataset to understand patterns in seeking help and offering assistance on social media. Overall, these contributions advance the field of crisis informatics and provide benchmarks for future research in the area.
comment: \copyright 2024 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works
♻ ☆ ECC Analyzer: Extract Trading Signal from Earnings Conference Calls using Large Language Model for Stock Performance Prediction
In the realm of financial analytics, leveraging unstructured data, such as earnings conference calls (ECCs), to forecast stock volatility is a critical challenge that has attracted both academics and investors. While previous studies have used multimodal deep learning-based models to obtain a general view of ECCs for volatility predicting, they often fail to capture detailed, complex information. Our research introduces a novel framework: \textbf{ECC Analyzer}, which utilizes large language models (LLMs) to extract richer, more predictive content from ECCs to aid the model's prediction performance. We use the pre-trained large models to extract textual and audio features from ECCs and implement a hierarchical information extraction strategy to extract more fine-grained information. This strategy first extracts paragraph-level general information by summarizing the text and then extracts fine-grained focus sentences using Retrieval-Augmented Generation (RAG). These features are then fused through multimodal feature fusion to perform volatility prediction. Experimental results demonstrate that our model outperforms traditional analytical benchmarks, confirming the effectiveness of advanced LLM techniques in financial analysis.
comment: 9 pages, 1 figures, 2 tables
♻ ☆ Large Language Multimodal Models for 5-Year Chronic Disease Cohort Prediction Using EHR Data
Chronic diseases such as diabetes are the leading causes of morbidity and mortality worldwide. Numerous research studies have been attempted with various deep learning models in diagnosis. However, most previous studies had certain limitations, including using publicly available datasets (e.g. MIMIC), and imbalanced data. In this study, we collected five-year electronic health records (EHRs) from the Taiwan hospital database, including 1,420,596 clinical notes, 387,392 laboratory test results, and more than 1,505 laboratory test items, focusing on research pre-training large language models. We proposed a novel Large Language Multimodal Models (LLMMs) framework incorporating multimodal data from clinical notes and laboratory test results for the prediction of chronic disease risk. Our method combined a text embedding encoder and multi-head attention layer to learn laboratory test values, utilizing a deep neural network (DNN) module to merge blood features with chronic disease semantics into a latent space. In our experiments, we observe that clinicalBERT and PubMed-BERT, when combined with attention fusion, can achieve an accuracy of 73% in multiclass chronic diseases and diabetes prediction. By transforming laboratory test values into textual descriptions and employing the Flan T-5 model, we achieved a 76% Area Under the ROC Curve (AUROC), demonstrating the effectiveness of leveraging numerical text data for training and inference in language models. This approach significantly improves the accuracy of early-stage diabetes prediction.
♻ ☆ Use of a Structured Knowledge Base Enhances Metadata Curation by Large Language Models
Metadata play a crucial role in ensuring the findability, accessibility, interoperability, and reusability of datasets. This paper investigates the potential of large language models (LLMs), specifically GPT-4, to improve adherence to metadata standards. We conducted experiments on 200 random data records describing human samples relating to lung cancer from the NCBI BioSample repository, evaluating GPT-4's ability to suggest edits for adherence to metadata standards. We computed the adherence accuracy of field name-field value pairs through a peer review process, and we observed a marginal average improvement in adherence to the standard data dictionary from 79% to 80% (p<0.5). We then prompted GPT-4 with domain information in the form of the textual descriptions of CEDAR templates and recorded a significant improvement to 97% from 79% (p<0.01). These results indicate that, while LLMs may not be able to correct legacy metadata to ensure satisfactory adherence to standards when unaided, they do show promise for use in automated metadata curation when integrated with a structured knowledge base
♻ ☆ Anchored Preference Optimization and Contrastive Revisions: Addressing Underspecification in Alignment
Large Language Models (LLMs) are often aligned using contrastive alignment objectives and preference pair datasets. The interaction between model, paired data, and objective makes alignment a complicated procedure, sometimes producing subpar results. We study this and find that (i) preference data gives a better learning signal when the underlying responses are contrastive, and (ii) alignment objectives lead to better performance when they specify more control over the model during training. Based on these insights, we introduce Contrastive Learning from AI Revisions (CLAIR), a data-creation method which leads to more contrastive preference pairs, and Anchored Preference Optimization (APO), a controllable and more stable alignment objective. We align Llama-3-8B-Instruct using various comparable datasets and alignment objectives and measure MixEval-Hard scores, which correlate highly with human judgments. The CLAIR preferences lead to the strongest performance out of all datasets, and APO consistently outperforms less controllable objectives. Our best model, trained on 32K CLAIR preferences with APO, improves Llama-3-8B-Instruct by 7.65%, closing the gap with GPT4-turbo by 45%. Our code is available at https://github.com/ContextualAI/CLAIR_and_APO.
♻ ☆ Text Generation: A Systematic Literature Review of Tasks, Evaluation, and Challenges
Text generation has become more accessible than ever, and the increasing interest in these systems, especially those using large language models, has spurred an increasing number of related publications. We provide a systematic literature review comprising 244 selected papers between 2017 and 2024. This review categorizes works in text generation into five main tasks: open-ended text generation, summarization, translation, paraphrasing, and question answering. For each task, we review their relevant characteristics, sub-tasks, and specific challenges (e.g., missing datasets for multi-document summarization, coherence in story generation, and complex reasoning for question answering). Additionally, we assess current approaches for evaluating text generation systems and ascertain problems with current metrics. Our investigation shows nine prominent challenges common to all tasks and sub-tasks in recent text generation publications: bias, reasoning, hallucinations, misuse, privacy, interpretability, transparency, datasets, and computing. We provide a detailed analysis of these challenges, their potential solutions, and which gaps still require further engagement from the community. This systematic literature review targets two main audiences: early career researchers in natural language processing looking for an overview of the field and promising research directions, as well as experienced researchers seeking a detailed view of tasks, evaluation methodologies, open challenges, and recent mitigation strategies.
comment: 35 pages, 2 figures, 2 tables, Under review
♻ ☆ A global AI community requires language-diverse publishing ICLR
In this provocation, we discuss the English dominance of the AI research community, arguing that the requirement for English language publishing upholds and reinforces broader regimes of extraction in AI. While large language models and machine translation have been celebrated as a way to break down barriers, we regard their use as a symptom of linguistic exclusion of scientists and potential readers. We propose alternative futures for a healthier publishing culture, organized around three themes: administering conferences in the languages of the country in which they are held, instructing peer reviewers not to adjudicate the language appropriateness of papers, and offering opportunities to publish and present in multiple languages. We welcome new translations of this piece. Please contact the authors if you would like to contribute one.
comment: Translations by Tianyu M. Fang (Mandarin Chinese), Michael Hardy (Guarani), Vandana Sarin and Vivek Sarin (Hindi), Roshna Omer Abdulrahman (Soran\^i Kurdish), Gabriel Poesia (Portuguese), and Mat\'ias Grinberg (Spanish). In the proceedings of the Global AI Cultures Workshop at the Twelfth International Conference on Learning Representations (ICLR) 2024, Vienna, Austria, May 7-11, 2024
♻ ☆ MelHuBERT: A simplified HuBERT on Mel spectrograms
Self-supervised models have had great success in learning speech representations that can generalize to various downstream tasks. However, most self-supervised models require a large amount of compute and multiple GPUs to train, significantly hampering the development of self-supervised learning. In an attempt to reduce the computation of training, we revisit the training of HuBERT, a highly successful self-supervised model. We improve and simplify several key components, including the loss function, input representation, and training in multiple stages. Our model, MelHuBERT, is able to achieve favorable performance on phone recognition, speaker identification, and automatic speech recognition against HuBERT, while saving 31.2% of the pre-training time, or equivalently 33.5% MACs per one second speech. The code and pre-trained models are available in https://github.com/nervjack2/MelHuBERT.
comment: ASRU 2023
♻ ☆ Are Small Language Models Ready to Compete with Large Language Models for Practical Applications?
The rapid rise of Language Models (LMs) has expanded their use in several applications. Yet, due to constraints of model size, associated cost, or proprietary restrictions, utilizing state-of-the-art (SOTA) LLMs is not always feasible. With open, smaller LMs emerging, more applications can leverage their capabilities, but selecting the right LM can be challenging as smaller LMs don't perform well universally. This work tries to bridge this gap by proposing a framework to experimentally evaluate small, open LMs in practical settings through measuring semantic correctness of outputs across three practical aspects: task types, application domains and reasoning types, using diverse prompt styles. It also conducts an in-depth comparison of 10 small, open LMs to identify best LM and prompt style depending on specific application requirement using the proposed framework. We also show that if selected appropriately, they can outperform SOTA LLMs like DeepSeek-v2, GPT-4o-mini, Gemini-1.5-Pro, and even compete with GPT-4o.
comment: Submitted to ARR
♻ ☆ Loop Copilot: Conducting AI Ensembles for Music Generation and Iterative Editing
Creating music is iterative, requiring varied methods at each stage. However, existing AI music systems fall short in orchestrating multiple subsystems for diverse needs. To address this gap, we introduce Loop Copilot, a novel system that enables users to generate and iteratively refine music through an interactive, multi-round dialogue interface. The system uses a large language model to interpret user intentions and select appropriate AI models for task execution. Each backend model is specialized for a specific task, and their outputs are aggregated to meet the user's requirements. To ensure musical coherence, essential attributes are maintained in a centralized table. We evaluate the effectiveness of the proposed system through semi-structured interviews and questionnaires, highlighting its utility not only in facilitating music creation but also its potential for broader applications.
comment: Source code and demo video are available at \url{https://sites.google.com/view/loop-copilot}
Computer Vision and Pattern Recognition
☆ 3D Whole-body Grasp Synthesis with Directional Controllability
Synthesizing 3D whole-bodies that realistically grasp objects is useful for animation, mixed reality, and robotics. This is challenging, because the hands and body need to look natural w.r.t. each other, the grasped object, as well as the local scene (i.e., a receptacle supporting the object). Only recent work tackles this, with a divide-and-conquer approach; it first generates a "guiding" right-hand grasp, and then searches for bodies that match this. However, the guiding-hand synthesis lacks controllability and receptacle awareness, so it likely has an implausible direction (i.e., a body can't match this without penetrating the receptacle) and needs corrections through major post-processing. Moreover, the body search needs exhaustive sampling and is expensive. These are strong limitations. We tackle these with a novel method called CWGrasp. Our key idea is that performing geometry-based reasoning "early on," instead of "too late," provides rich "control" signals for inference. To this end, CWGrasp first samples a plausible reaching-direction vector (used later for both the arm and hand) from a probabilistic model built via raycasting from the object and collision checking. Then, it generates a reaching body with a desired arm direction, as well as a "guiding" grasping hand with a desired palm direction that complies with the arm's one. Eventually, CWGrasp refines the body to match the "guiding" hand, while plausibly contacting the scene. Notably, generating already-compatible "parts" greatly simplifies the "whole." Moreover, CWGrasp uniquely tackles both right- and left-hand grasps. We evaluate on the GRAB and ReplicaGrasp datasets. CWGrasp outperforms baselines, at lower runtime and budget, while all components help performance. Code and models will be released.
☆ SAM2Point: Segment Any 3D as Videos in Zero-shot and Promptable Manners
We introduce SAM2Point, a preliminary exploration adapting Segment Anything Model 2 (SAM 2) for zero-shot and promptable 3D segmentation. SAM2Point interprets any 3D data as a series of multi-directional videos, and leverages SAM 2 for 3D-space segmentation, without further training or 2D-3D projection. Our framework supports various prompt types, including 3D points, boxes, and masks, and can generalize across diverse scenarios, such as 3D objects, indoor scenes, outdoor environments, and raw sparse LiDAR. Demonstrations on multiple 3D datasets, e.g., Objaverse, S3DIS, ScanNet, Semantic3D, and KITTI, highlight the robust generalization capabilities of SAM2Point. To our best knowledge, we present the most faithful implementation of SAM in 3D, which may serve as a starting point for future research in promptable 3D segmentation. Online Demo: https://huggingface.co/spaces/ZiyuG/SAM2Point . Code: https://github.com/ZiyuGuo99/SAM2Point .
comment: Work in progress. Online Demo: https://huggingface.co/spaces/ZiyuG/SAM2Point . Code: https://github.com/ZiyuGuo99/SAM2Point
PromptSmooth: Certifying Robustness of Medical Vision-Language Models via Prompt Learning MICCAI 2024
Medical vision-language models (Med-VLMs) trained on large datasets of medical image-text pairs and later fine-tuned for specific tasks have emerged as a mainstream paradigm in medical image analysis. However, recent studies have highlighted the susceptibility of these Med-VLMs to adversarial attacks, raising concerns about their safety and robustness. Randomized smoothing is a well-known technique for turning any classifier into a model that is certifiably robust to adversarial perturbations. However, this approach requires retraining the Med-VLM-based classifier so that it classifies well under Gaussian noise, which is often infeasible in practice. In this paper, we propose a novel framework called PromptSmooth to achieve efficient certified robustness of Med-VLMs by leveraging the concept of prompt learning. Given any pre-trained Med-VLM, PromptSmooth adapts it to handle Gaussian noise by learning textual prompts in a zero-shot or few-shot manner, achieving a delicate balance between accuracy and robustness, while minimizing the computational overhead. Moreover, PromptSmooth requires only a single model to handle multiple noise levels, which substantially reduces the computational cost compared to traditional methods that rely on training a separate model for each noise level. Comprehensive experiments based on three Med-VLMs and across six downstream datasets of various imaging modalities demonstrate the efficacy of PromptSmooth. Our code and models are available at https://github.com/nhussein/promptsmooth.
comment: Accepted to MICCAI 2024
☆ ReconX: Reconstruct Any Scene from Sparse Views with Video Diffusion Model
Advancements in 3D scene reconstruction have transformed 2D images from the real world into 3D models, producing realistic 3D results from hundreds of input photos. Despite great success in dense-view reconstruction scenarios, rendering a detailed scene from insufficient captured views is still an ill-posed optimization problem, often resulting in artifacts and distortions in unseen areas. In this paper, we propose ReconX, a novel 3D scene reconstruction paradigm that reframes the ambiguous reconstruction challenge as a temporal generation task. The key insight is to unleash the strong generative prior of large pre-trained video diffusion models for sparse-view reconstruction. However, 3D view consistency struggles to be accurately preserved in directly generated video frames from pre-trained models. To address this, given limited input views, the proposed ReconX first constructs a global point cloud and encodes it into a contextual space as the 3D structure condition. Guided by the condition, the video diffusion model then synthesizes video frames that are both detail-preserved and exhibit a high degree of 3D consistency, ensuring the coherence of the scene from various perspectives. Finally, we recover the 3D scene from the generated video through a confidence-aware 3D Gaussian Splatting optimization scheme. Extensive experiments on various real-world datasets show the superiority of our ReconX over state-of-the-art methods in terms of quality and generalizability.
comment: Project page: https://liuff19.github.io/ReconX
☆ CSGO: Content-Style Composition in Text-to-Image Generation
The diffusion model has shown exceptional capabilities in controlled image generation, which has further fueled interest in image style transfer. Existing works mainly focus on training free-based methods (e.g., image inversion) due to the scarcity of specific data. In this study, we present a data construction pipeline for content-style-stylized image triplets that generates and automatically cleanses stylized data triplets. Based on this pipeline, we construct a dataset IMAGStyle, the first large-scale style transfer dataset containing 210k image triplets, available for the community to explore and research. Equipped with IMAGStyle, we propose CSGO, a style transfer model based on end-to-end training, which explicitly decouples content and style features employing independent feature injection. The unified CSGO implements image-driven style transfer, text-driven stylized synthesis, and text editing-driven stylized synthesis. Extensive experiments demonstrate the effectiveness of our approach in enhancing style control capabilities in image generation. Additional visualization and access to the source code can be located on the project page: \url{https://csgo-gen.github.io/}.
☆ UV-free Texture Generation with Denoising and Geodesic Heat Diffusions
Seams, distortions, wasted UV space, vertex-duplication, and varying resolution over the surface are the most prominent issues of the standard UV-based texturing of meshes. These issues are particularly acute when automatic UV-unwrapping techniques are used. For this reason, instead of generating textures in automatically generated UV-planes like most state-of-the-art methods, we propose to represent textures as coloured point-clouds whose colours are generated by a denoising diffusion probabilistic model constrained to operate on the surface of 3D objects. Our sampling and resolution agnostic generative model heavily relies on heat diffusion over the surface of the meshes for spatial communication between points. To enable processing of arbitrarily sampled point-cloud textures and ensure long-distance texture consistency we introduce a fast re-sampling of the mesh spectral properties used during the heat diffusion and introduce a novel heat-diffusion-based self-attention mechanism. Our code and pre-trained models are available at github.com/simofoti/UV3-TeD.
☆ OmniRe: Omni Urban Scene Reconstruction
We introduce OmniRe, a holistic approach for efficiently reconstructing high-fidelity dynamic urban scenes from on-device logs. Recent methods for modeling driving sequences using neural radiance fields or Gaussian Splatting have demonstrated the potential of reconstructing challenging dynamic scenes, but often overlook pedestrians and other non-vehicle dynamic actors, hindering a complete pipeline for dynamic urban scene reconstruction. To that end, we propose a comprehensive 3DGS framework for driving scenes, named OmniRe, that allows for accurate, full-length reconstruction of diverse dynamic objects in a driving log. OmniRe builds dynamic neural scene graphs based on Gaussian representations and constructs multiple local canonical spaces that model various dynamic actors, including vehicles, pedestrians, and cyclists, among many others. This capability is unmatched by existing methods. OmniRe allows us to holistically reconstruct different objects present in the scene, subsequently enabling the simulation of reconstructed scenarios with all actors participating in real-time (~60Hz). Extensive evaluations on the Waymo dataset show that our approach outperforms prior state-of-the-art methods quantitatively and qualitatively by a large margin. We believe our work fills a critical gap in driving reconstruction.
comment: See the project page for code, video results and demos: https://ziyc.github.io/omnire/
☆ Dissecting Out-of-Distribution Detection and Open-Set Recognition: A Critical Analysis of Methods and Benchmarks
Detecting test-time distribution shift has emerged as a key capability for safely deployed machine learning models, with the question being tackled under various guises in recent years. In this paper, we aim to provide a consolidated view of the two largest sub-fields within the community: out-of-distribution (OOD) detection and open-set recognition (OSR). In particular, we aim to provide rigorous empirical analysis of different methods across settings and provide actionable takeaways for practitioners and researchers. Concretely, we make the following contributions: (i) We perform rigorous cross-evaluation between state-of-the-art methods in the OOD detection and OSR settings and identify a strong correlation between the performances of methods for them; (ii) We propose a new, large-scale benchmark setting which we suggest better disentangles the problem tackled by OOD detection and OSR, re-evaluating state-of-the-art OOD detection and OSR methods in this setting; (iii) We surprisingly find that the best performing method on standard benchmarks (Outlier Exposure) struggles when tested at scale, while scoring rules which are sensitive to the deep feature magnitude consistently show promise; and (iv) We conduct empirical analysis to explain these phenomena and highlight directions for future research. Code: \url{https://github.com/Visual-AI/Dissect-OOD-OSR}
comment: Accepted to IJCV, preprint version
☆ VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths Vision Computation
A well-known dilemma in large vision-language models (e.g., GPT-4, LLaVA) is that while increasing the number of vision tokens generally enhances visual understanding, it also significantly raises memory and computational costs, especially in long-term, dense video frame streaming scenarios. Although learnable approaches like Q-Former and Perceiver Resampler have been developed to reduce the vision token burden, they overlook the context causally modeled by LLMs (i.e., key-value cache), potentially leading to missed visual cues when addressing user queries. In this paper, we introduce a novel approach to reduce vision compute by leveraging redundant vision tokens "skipping layers" rather than decreasing the number of vision tokens. Our method, VideoLLM-MoD, is inspired by mixture-of-depths LLMs and addresses the challenge of numerous vision tokens in long-term or streaming video. Specifically, for each transformer layer, we learn to skip the computation for a high proportion (e.g., 80\%) of vision tokens, passing them directly to the next layer. This approach significantly enhances model efficiency, achieving approximately \textasciitilde42\% time and \textasciitilde30\% memory savings for the entire training. Moreover, our method reduces the computation in the context and avoid decreasing the vision tokens, thus preserving or even improving performance compared to the vanilla model. We conduct extensive experiments to demonstrate the effectiveness of VideoLLM-MoD, showing its state-of-the-art results on multiple benchmarks, including narration, forecasting, and summarization tasks in COIN, Ego4D, and Ego-Exo4D datasets.
☆ Prediction-Feedback DETR for Temporal Action Detection
Temporal Action Detection (TAD) is fundamental yet challenging for real-world video applications. Leveraging the unique benefits of transformers, various DETR-based approaches have been adopted in TAD. However, it has recently been identified that the attention collapse in self-attention causes the performance degradation of DETR for TAD. Building upon previous research, this paper newly addresses the attention collapse problem in cross-attention within DETR-based TAD methods. Moreover, our findings reveal that cross-attention exhibits patterns distinct from predictions, indicating a short-cut phenomenon. To resolve this, we propose a new framework, Prediction-Feedback DETR (Pred-DETR), which utilizes predictions to restore the collapse and align the cross- and self-attention with predictions. Specifically, we devise novel prediction-feedback objectives using guidance from the relations of the predictions. As a result, Pred-DETR significantly alleviates the collapse and achieves state-of-the-art performance among DETR-based methods on various challenging benchmarks including THUMOS14, ActivityNet-v1.3, HACS, and FineAction.
☆ H-SGANet: Hybrid Sparse Graph Attention Network for Deformable Medical Image Registration
The integration of Convolutional Neural Network (ConvNet) and Transformer has emerged as a strong candidate for image registration, leveraging the strengths of both models and a large parameter space. However, this hybrid model, treating brain MRI volumes as grid or sequence structures, faces challenges in accurately representing anatomical connectivity, diverse brain regions, and vital connections contributing to the brain's internal architecture. Concerns also arise regarding the computational expense and GPU memory usage associated with this model. To tackle these issues, a lightweight hybrid sparse graph attention network (H-SGANet) has been developed. This network incorporates a central mechanism, Sparse Graph Attention (SGA), based on a Vision Graph Neural Network (ViG) with predetermined anatomical connections. The SGA module expands the model's receptive field and seamlessly integrates into the network. To further amplify the advantages of the hybrid network, the Separable Self-Attention (SSA) is employed as an enhanced token mixer, integrated with depth-wise convolution to constitute SSAFormer. This strategic integration is designed to more effectively extract long-range dependencies. As a hybrid ConvNet-ViG-Transformer model, H-SGANet offers threefold benefits for volumetric medical image registration. It optimizes fixed and moving images concurrently through a hybrid feature fusion layer and an end-to-end learning framework. Compared to VoxelMorph, a model with a similar parameter count, H-SGANet demonstrates significant performance enhancements of 3.5% and 1.5% in Dice score on the OASIS dataset and LPBA40 dataset, respectively.
☆ One-Shot Learning Meets Depth Diffusion in Multi-Object Videos
Creating editable videos that depict complex interactions between multiple objects in various artistic styles has long been a challenging task in filmmaking. Progress is often hampered by the scarcity of data sets that contain paired text descriptions and corresponding videos that showcase these interactions. This paper introduces a novel depth-conditioning approach that significantly advances this field by enabling the generation of coherent and diverse videos from just a single text-video pair using a pre-trained depth-aware Text-to-Image (T2I) model. Our method fine-tunes the pre-trained model to capture continuous motion by employing custom-designed spatial and temporal attention mechanisms. During inference, we use the DDIM inversion to provide structural guidance for video generation. This innovative technique allows for continuously controllable depth in videos, facilitating the generation of multiobject interactions while maintaining the concept generation and compositional strengths of the original T2I model across various artistic styles, such as photorealism, animation, and impressionism.
☆ GradBias: Unveiling Word Influence on Bias in Text-to-Image Generative Models
Recent progress in Text-to-Image (T2I) generative models has enabled high-quality image generation. As performance and accessibility increase, these models are gaining significant attraction and popularity: ensuring their fairness and safety is a priority to prevent the dissemination and perpetuation of biases. However, existing studies in bias detection focus on closed sets of predefined biases (e.g., gender, ethnicity). In this paper, we propose a general framework to identify, quantify, and explain biases in an open set setting, i.e. without requiring a predefined set. This pipeline leverages a Large Language Model (LLM) to propose biases starting from a set of captions. Next, these captions are used by the target generative model for generating a set of images. Finally, Vision Question Answering (VQA) is leveraged for bias evaluation. We show two variations of this framework: OpenBias and GradBias. OpenBias detects and quantifies biases, while GradBias determines the contribution of individual prompt words on biases. OpenBias effectively detects both well-known and novel biases related to people, objects, and animals and highly aligns with existing closed-set bias detection methods and human judgment. GradBias shows that neutral words can significantly influence biases and it outperforms several baselines, including state-of-the-art foundation models. Code available here: https://github.com/Moreno98/GradBias.
comment: Under review. Code: https://github.com/Moreno98/GradBias
☆ Generic Objects as Pose Probes for Few-Shot View Synthesis
Radiance fields including NeRFs and 3D Gaussians demonstrate great potential in high-fidelity rendering and scene reconstruction, while they require a substantial number of posed images as inputs. COLMAP is frequently employed for preprocessing to estimate poses, while it necessitates a large number of feature matches to operate effectively, and it struggles with scenes characterized by sparse features, large baselines between images, or a limited number of input images. We aim to tackle few-view NeRF reconstruction using only 3 to 6 unposed scene images. Traditional methods often use calibration boards but they are not common in images. We propose a novel idea of utilizing everyday objects, commonly found in both images and real life, as "pose probes". The probe object is automatically segmented by SAM, whose shape is initialized from a cube. We apply a dual-branch volume rendering optimization (object NeRF and scene NeRF) to constrain the pose optimization and jointly refine the geometry. Specifically, object poses of two views are first estimated by PnP matching in an SDF representation, which serves as initial poses. PnP matching, requiring only a few features, is suitable for feature-sparse scenes. Additional views are incrementally incorporated to refine poses from preceding views. In experiments, PoseProbe achieves state-of-the-art performance in both pose estimation and novel view synthesis across multiple datasets. We demonstrate its effectiveness, particularly in few-view and large-baseline scenes where COLMAP struggles. In ablations, using different objects in a scene yields comparable performance.
☆ PartFormer: Awakening Latent Diverse Representation from Vision Transformer for Object Re-Identification
Extracting robust feature representation is critical for object re-identification to accurately identify objects across non-overlapping cameras. Although having a strong representation ability, the Vision Transformer (ViT) tends to overfit on most distinct regions of training data, limiting its generalizability and attention to holistic object features. Meanwhile, due to the structural difference between CNN and ViT, fine-grained strategies that effectively address this issue in CNN do not continue to be successful in ViT. To address this issue, by observing the latent diverse representation hidden behind the multi-head attention, we present PartFormer, an innovative adaptation of ViT designed to overcome the granularity limitations in object Re-ID tasks. The PartFormer integrates a Head Disentangling Block (HDB) that awakens the diverse representation of multi-head self-attention without the typical loss of feature richness induced by concatenation and FFN layers post-attention. To avoid the homogenization of attention heads and promote robust part-based feature learning, two head diversity constraints are imposed: attention diversity constraint and correlation diversity constraint. These constraints enable the model to exploit diverse and discriminative feature representations from different attention heads. Comprehensive experiments on various object Re-ID benchmarks demonstrate the superiority of the PartFormer. Specifically, our framework significantly outperforms state-of-the-art by 2.4\% mAP scores on the most challenging MSMT17 dataset.
☆ Space3D-Bench: Spatial 3D Question Answering Benchmark
Answering questions about the spatial properties of the environment poses challenges for existing language and vision foundation models due to a lack of understanding of the 3D world notably in terms of relationships between objects. To push the field forward, multiple 3D Q&A datasets were proposed which, overall, provide a variety of questions, but they individually focus on particular aspects of 3D reasoning or are limited in terms of data modalities. To address this, we present Space3D-Bench - a collection of 1000 general spatial questions and answers related to scenes of the Replica dataset which offers a variety of data modalities: point clouds, posed RGB-D images, navigation meshes and 3D object detections. To ensure that the questions cover a wide range of 3D objectives, we propose an indoor spatial questions taxonomy inspired by geographic information systems and use it to balance the dataset accordingly. Moreover, we provide an assessment system that grades natural language responses based on predefined ground-truth answers by leveraging a Vision Language Model's comprehension of both text and images to compare the responses with ground-truth textual information or relevant visual data. Finally, we introduce a baseline called RAG3D-Chat integrating the world understanding of foundation models with rich context retrieval, achieving an accuracy of 67% on the proposed dataset.
☆ Eigen-Cluster VIS: Improving Weakly-supervised Video Instance Segmentation by Leveraging Spatio-temporal Consistency
The performance of Video Instance Segmentation (VIS) methods has improved significantly with the advent of transformer networks. However, these networks often face challenges in training due to the high annotation cost. To address this, unsupervised and weakly-supervised methods have been developed to reduce the dependency on annotations. This work introduces a novel weakly-supervised method called Eigen-cluster VIS that, without requiring any mask annotations, achieves competitive accuracy compared to other VIS approaches. This method is based on two key innovations: a Temporal Eigenvalue Loss (TEL) and a clip-level Quality Cluster Coefficient (QCC). The TEL ensures temporal coherence by leveraging the eigenvalues of the Laplacian matrix derived from graph adjacency matrices. By minimizing the mean absolute error (MAE) between the eigenvalues of adjacent frames, this loss function promotes smooth transitions and stable segmentation boundaries over time, reducing temporal discontinuities and improving overall segmentation quality. The QCC employs the K-means method to ensure the quality of spatio-temporal clusters without relying on ground truth masks. Using the Davies-Bouldin score, the QCC provides an unsupervised measure of feature discrimination, allowing the model to self-evaluate and adapt to varying object distributions, enhancing robustness during the testing phase. These enhancements are computationally efficient and straightforward, offering significant performance gains without additional annotated data. The proposed Eigen-Cluster VIS method is evaluated on the YouTube-VIS 2019/2021 and OVIS datasets, demonstrating that it effectively narrows the performance gap between the fully-supervised and weakly-supervised VIS approaches. The code is available on: https://github.com/farnooshar/EigenClusterVIS
comment: 12 pages, 6 Figures, 5 tabels
☆ DriveGenVLM: Real-world Video Generation for Vision Language Model based Autonomous Driving
The advancement of autonomous driving technologies necessitates increasingly sophisticated methods for understanding and predicting real-world scenarios. Vision language models (VLMs) are emerging as revolutionary tools with significant potential to influence autonomous driving. In this paper, we propose the DriveGenVLM framework to generate driving videos and use VLMs to understand them. To achieve this, we employ a video generation framework grounded in denoising diffusion probabilistic models (DDPM) aimed at predicting real-world video sequences. We then explore the adequacy of our generated videos for use in VLMs by employing a pre-trained model known as Efficient In-context Learning on Egocentric Videos (EILEV). The diffusion model is trained with the Waymo open dataset and evaluated using the Fr\'echet Video Distance (FVD) score to ensure the quality and realism of the generated videos. Corresponding narrations are provided by EILEV for these generated videos, which may be beneficial in the autonomous driving domain. These narrations can enhance traffic scene understanding, aid in navigation, and improve planning capabilities. The integration of video generation with VLMs in the DriveGenVLM framework represents a significant step forward in leveraging advanced AI models to address complex challenges in autonomous driving.
☆ SODAWideNet++: Combining Attention and Convolutions for Salient Object Detection ICPR 2024
Salient Object Detection (SOD) has traditionally relied on feature refinement modules that utilize the features of an ImageNet pre-trained backbone. However, this approach limits the possibility of pre-training the entire network because of the distinct nature of SOD and image classification. Additionally, the architecture of these backbones originally built for Image classification is sub-optimal for a dense prediction task like SOD. To address these issues, we propose a novel encoder-decoder-style neural network called SODAWideNet++ that is designed explicitly for SOD. Inspired by the vision transformers ability to attain a global receptive field from the initial stages, we introduce the Attention Guided Long Range Feature Extraction (AGLRFE) module, which combines large dilated convolutions and self-attention. Specifically, we use attention features to guide long-range information extracted by multiple dilated convolutions, thus taking advantage of the inductive biases of a convolution operation and the input dependency brought by self-attention. In contrast to the current paradigm of ImageNet pre-training, we modify 118K annotated images from the COCO semantic segmentation dataset by binarizing the annotations to pre-train the proposed model end-to-end. Further, we supervise the background predictions along with the foreground to push our model to generate accurate saliency predictions. SODAWideNet++ performs competitively on five different datasets while only containing 35% of the trainable parameters compared to the state-of-the-art models. The code and pre-computed saliency maps are provided at https://github.com/VimsLab/SODAWideNetPlusPlus.
comment: Accepted at ICPR 2024
☆ 3D Pose-Based Temporal Action Segmentation for Figure Skating: A Fine-Grained and Jump Procedure-Aware Annotation Approach
Understanding human actions from videos is essential in many domains, including sports. In figure skating, technical judgments are performed by watching skaters' 3D movements, and its part of the judging procedure can be regarded as a Temporal Action Segmentation (TAS) task. TAS tasks in figure skating that automatically assign temporal semantics to video are actively researched. However, there is a lack of datasets and effective methods for TAS tasks requiring 3D pose data. In this study, we first created the FS-Jump3D dataset of complex and dynamic figure skating jumps using optical markerless motion capture. We also propose a new fine-grained figure skating jump TAS dataset annotation method with which TAS models can learn jump procedures. In the experimental results, we validated the usefulness of 3D pose features as input and the fine-grained dataset for the TAS model in figure skating. FS-Jump3D Dataset is available at https://github.com/ryota-skating/FS-Jump3D.
comment: 10 pages, 7th ACM International Workshop on Multimedia Content Analysis in Sports
☆ Turbulence Strength $C_n^2$ Estimation from Video using Physics-based Deep Learning
Images captured from a long distance suffer from dynamic image distortion due to turbulent flow of air cells with random temperatures, and thus refractive indices. This phenomenon, known as image dancing, is commonly characterized by its refractive-index structure constant $C_n^2$ as a measure of the turbulence strength. For many applications such as atmospheric forecast model, long-range/astronomy imaging, and aviation safety, optical communication technology, $C_n^2$ estimation is critical for accurately sensing the turbulent environment. Previous methods for $C_n^2$ estimation include estimation from meteorological data (temperature, relative humidity, wind shear, etc.) for single-point measurements, two-ended pathlength measurements from optical scintillometer for path-averaged $C_n^2$, and more recently estimating $C_n^2$ from passive video cameras for low cost and hardware complexity. In this paper, we present a comparative analysis of classical image gradient methods for $C_n^2$ estimation and modern deep learning-based methods leveraging convolutional neural networks. To enable this, we collect a dataset of video capture along with reference scintillometer measurements for ground truth, and we release this unique dataset to the scientific community. We observe that deep learning methods can achieve higher accuracy when trained on similar data, but suffer from generalization errors to other, unseen imagery as compared to classical methods. To overcome this trade-off, we present a novel physics-based network architecture that combines learned convolutional layers with a differentiable image gradient method that maintains high accuracy while being generalizable across image datasets.
comment: Code Available: https://github.com/Riponcs/Cn2Estimation
☆ Sparse Signal Reconstruction for Overdispersed Low-photon Count Biomedical Imaging Using $\ell_p$ Total Variation
The negative binomial model, which generalizes the Poisson distribution model, can be found in applications involving low-photon signal recovery, including medical imaging. Recent studies have explored several regularization terms for the negative binomial model, such as the $\ell_p$ quasi-norm with $0 < p < 1$, $\ell_1$ norm, and the total variation (TV) quasi-seminorm for promoting sparsity in signal recovery. These penalty terms have been shown to improve image reconstruction outcomes. In this paper, we investigate the $\ell_p$ quasi-seminorm, both isotropic and anisotropic $\ell_p$ TV quasi-seminorms, within the framework of the negative binomial statistical model. This problem can be formulated as an optimization problem, which we solve using a gradient-based approach. We present comparisons between the negative binomial and Poisson statistical models using the $\ell_p$ TV quasi-seminorm as well as common penalty terms. Our experimental results highlight the efficacy of the proposed method.
comment: 5 pages, Accepted by the IEEE International Symposium on Biomedical Imaging (ISBI)
☆ Towards Infusing Auxiliary Knowledge for Distracted Driver Detection KDD
Distracted driving is a leading cause of road accidents globally. Identification of distracted driving involves reliably detecting and classifying various forms of driver distraction (e.g., texting, eating, or using in-car devices) from in-vehicle camera feeds to enhance road safety. This task is challenging due to the need for robust models that can generalize to a diverse set of driver behaviors without requiring extensive annotated datasets. In this paper, we propose KiD3, a novel method for distracted driver detection (DDD) by infusing auxiliary knowledge about semantic relations between entities in a scene and the structural configuration of the driver's pose. Specifically, we construct a unified framework that integrates the scene graphs, and driver pose information with the visual cues in video frames to create a holistic representation of the driver's actions.Our results indicate that KiD3 achieves a 13.64% accuracy improvement over the vision-only baseline by incorporating such auxiliary knowledge with visual information.
comment: Accepted at KiL 2024: Workshop on Knowledge-infused Learning co-located with 30th ACM KDD Conference
☆ FastForensics: Efficient Two-Stream Design for Real-Time Image Manipulation Detection BMVC 2024
With the rise in popularity of portable devices, the spread of falsified media on social platforms has become rampant. This necessitates the timely identification of authentic content. However, most advanced detection methods are computationally heavy, hindering their real-time application. In this paper, we describe an efficient two-stream architecture for real-time image manipulation detection. Our method consists of two-stream branches targeting the cognitive and inspective perspectives. In the cognitive branch, we propose efficient wavelet-guided Transformer blocks to capture the global manipulation traces related to frequency. This block contains an interactive wavelet-guided self-attention module that integrates wavelet transformation with efficient attention design, interacting with the knowledge from the inspective branch. The inspective branch consists of simple convolutions that capture fine-grained traces and interact bidirectionally with Transformer blocks to provide mutual support. Our method is lightweight ($\sim$ 8M) but achieves competitive performance compared to many other counterparts, demonstrating its efficacy in image manipulation detection and its potential for portable integration.
comment: BMVC 2024
☆ MST-KD: Multiple Specialized Teachers Knowledge Distillation for Fair Face Recognition ECCV 2024
As in school, one teacher to cover all subjects is insufficient to distill equally robust information to a student. Hence, each subject is taught by a highly specialised teacher. Following a similar philosophy, we propose a multiple specialized teacher framework to distill knowledge to a student network. In our approach, directed at face recognition use cases, we train four teachers on one specific ethnicity, leading to four highly specialized and biased teachers. Our strategy learns a project of these four teachers into a common space and distill that information to a student network. Our results highlighted increased performance and reduced bias for all our experiments. In addition, we further show that having biased/specialized teachers is crucial by showing that our approach achieves better results than when knowledge is distilled from four teachers trained on balanced datasets. Our approach represents a step forward to the understanding of the importance of ethnicity-specific features.
comment: Accepted at ECCV 2024 ABAW
☆ OP-Align: Object-level and Part-level Alignment for Self-supervised Category-level Articulated Object Pose Estimation ECCV2024
Category-level articulated object pose estimation focuses on the pose estimation of unknown articulated objects within known categories. Despite its significance, this task remains challenging due to the varying shapes and poses of objects, expensive dataset annotation costs, and complex real-world environments. In this paper, we propose a novel self-supervised approach that leverages a single-frame point cloud to solve this task. Our model consistently generates reconstruction with a canonical pose and joint state for the entire input object, and it estimates object-level poses that reduce overall pose variance and part-level poses that align each part of the input with its corresponding part of the reconstruction. Experimental results demonstrate that our approach significantly outperforms previous self-supervised methods and is comparable to the state-of-the-art supervised methods. To assess the performance of our model in real-world scenarios, we also introduce a new real-world articulated object benchmark dataset.
comment: to be published in ECCV2024
☆ Spurfies: Sparse Surface Reconstruction using Local Geometry Priors
We introduce Spurfies, a novel method for sparse-view surface reconstruction that disentangles appearance and geometry information to utilize local geometry priors trained on synthetic data. Recent research heavily focuses on 3D reconstruction using dense multi-view setups, typically requiring hundreds of images. However, these methods often struggle with few-view scenarios. Existing sparse-view reconstruction techniques often rely on multi-view stereo networks that need to learn joint priors for geometry and appearance from a large amount of data. In contrast, we introduce a neural point representation that disentangles geometry and appearance to train a local geometry prior using a subset of the synthetic ShapeNet dataset only. During inference, we utilize this surface prior as additional constraint for surface and appearance reconstruction from sparse input views via differentiable volume rendering, restricting the space of possible solutions. We validate the effectiveness of our method on the DTU dataset and demonstrate that it outperforms previous state of the art by 35% in surface quality while achieving competitive novel view synthesis quality. Moreover, in contrast to previous works, our method can be applied to larger, unbounded scenes, such as Mip-NeRF 360.
comment: https://geometric-rl.mpi-inf.mpg.de/spurfies/
☆ GRPose: Learning Graph Relations for Human Image Generation with Pose Priors
Recent methods using diffusion models have made significant progress in human image generation with various additional controls such as pose priors. However, existing approaches still struggle to generate high-quality images with consistent pose alignment, resulting in unsatisfactory outputs. In this paper, we propose a framework delving into the graph relations of pose priors to provide control information for human image generation. The main idea is to establish a graph topological structure between the pose priors and latent representation of diffusion models to capture the intrinsic associations between different pose parts. A Progressive Graph Integrator (PGI) is designed to learn the spatial relationships of the pose priors with the graph structure, adopting a hierarchical strategy within an Adapter to gradually propagate information across different pose parts. A pose perception loss is further introduced based on a pretrained pose estimation network to minimize the pose differences. Extensive qualitative and quantitative experiments conducted on the Human-Art and LAION-Human datasets demonstrate that our model achieves superior performance, with a 9.98% increase in pose average precision compared to the latest benchmark model. The code is released on *******.
comment: The code will be released at https://github.com/XiangchenYin/GRPose
☆ Towards Modality-agnostic Label-efficient Segmentation with Entropy-Regularized Distribution Alignment
Label-efficient segmentation aims to perform effective segmentation on input data using only sparse and limited ground-truth labels for training. This topic is widely studied in 3D point cloud segmentation due to the difficulty of annotating point clouds densely, while it is also essential for cost-effective segmentation on 2D images. Until recently, pseudo-labels have been widely employed to facilitate training with limited ground-truth labels, and promising progress has been witnessed in both the 2D and 3D segmentation. However, existing pseudo-labeling approaches could suffer heavily from the noises and variations in unlabelled data, which would result in significant discrepancies between generated pseudo-labels and current model predictions during training. We analyze that this can further confuse and affect the model learning process, which shows to be a shared problem in label-efficient learning across both 2D and 3D modalities. To address this issue, we propose a novel learning strategy to regularize the pseudo-labels generated for training, thus effectively narrowing the gaps between pseudo-labels and model predictions. More specifically, our method introduces an Entropy Regularization loss and a Distribution Alignment loss for label-efficient learning, resulting in an ERDA learning strategy. Interestingly, by using KL distance to formulate the distribution alignment loss, ERDA reduces to a deceptively simple cross-entropy-based loss which optimizes both the pseudo-label generation module and the segmentation model simultaneously. In addition, we innovate in the pseudo-label generation to make our ERDA consistently effective across both 2D and 3D data modalities for segmentation. Enjoying simplicity and more modality-agnostic pseudo-label generation, our method has shown outstanding performance in fully utilizing all unlabeled data points for training across ...
comment: Extended version of arXiv:2305.15832; Code at https://github.com/LiyaoTang/ERDA
☆ Alignment is All You Need: A Training-free Augmentation Strategy for Pose-guided Video Generation ICML 2024
Character animation is a transformative field in computer graphics and vision, enabling dynamic and realistic video animations from static images. Despite advancements, maintaining appearance consistency in animations remains a challenge. Our approach addresses this by introducing a training-free framework that ensures the generated video sequence preserves the reference image's subtleties, such as physique and proportions, through a dual alignment strategy. We decouple skeletal and motion priors from pose information, enabling precise control over animation generation. Our method also improves pixel-level alignment for conditional control from the reference character, enhancing the temporal consistency and visual cohesion of animations. Our method significantly enhances the quality of video generation without the need for large datasets or expensive computational resources.
comment: CVG@ICML 2024
☆ A Simple and Generalist Approach for Panoptic Segmentation
Generalist vision models aim for one and the same architecture for a variety of vision tasks. While such shared architecture may seem attractive, generalist models tend to be outperformed by their bespoken counterparts, especially in the case of panoptic segmentation. We address this problem by introducing two key contributions, without compromising the desirable properties of generalist models. These contributions are: (i) a positional-embedding (PE) based loss for improved centroid regressions; (ii) Edge Distance Sampling (EDS) for the better separation of instance boundaries. The PE-based loss facilitates a better per-pixel regression of the associated instance's centroid, whereas EDS contributes by carefully handling the void regions (caused by missing labels) and smaller instances. These two simple yet effective modifications significantly improve established baselines, while achieving state-of-the-art results among all generalist solutions. More specifically, our method achieves a panoptic quality(PQ) of 52.5 on the COCO dataset, which is an improvement of 10 points over the best model with similar approach (Painter), and is superior by 2 to the best performing diffusion-based method Pix2Seq-$\mathcal{D}$. Furthermore, we provide insights into and an in-depth analysis of our contributions through exhaustive experiments. Our source code and model weights will be made publicly available.
☆ Locally Grouped and Scale-Guided Attention for Dense Pest Counting
This study introduces a new dense pest counting problem to predict densely distributed pests captured by digital traps. Unlike traditional detection-based counting models for sparsely distributed objects, trap-based pest counting must deal with dense pest distributions that pose challenges such as severe occlusion, wide pose variation, and similar appearances in colors and textures. To address these problems, it is essential to incorporate the local attention mechanism, which identifies locally important and unimportant areas to learn locally grouped features, thereby enhancing discriminative performance. Accordingly, this study presents a novel design that integrates locally grouped and scale-guided attention into a multiscale CenterNet framework. To group local features with similar attributes, a straightforward method is introduced using the heatmap predicted by the first hourglass containing pest centroid information, which eliminates the need for complex clustering models. To enhance attentiveness, the pixel attention module transforms the heatmap into a learnable map. Subsequently, scale-guided attention is deployed to make the object and background features more discriminative, achieving multiscale feature fusion. Through experiments, the proposed model is verified to enhance object features based on local grouping and discriminative feature attention learning. Additionally, the proposed model is highly effective in overcoming occlusion and pose variation problems, making it more suitable for dense pest counting. In particular, the proposed model outperforms state-of-the-art models by a large margin, with a remarkable contribution to dense pest counting.
☆ UAV-Based Human Body Detector Selection and Fusion for Geolocated Saliency Map Generation
The problem of reliably detecting and geolocating objects of different classes in soft real-time is essential in many application areas, such as Search and Rescue performed using Unmanned Aerial Vehicles (UAVs). This research addresses the complementary problems of system contextual vision-based detector selection, allocation, and execution, in addition to the fusion of detection results from teams of UAVs for the purpose of accurately and reliably geolocating objects of interest in a timely manner. In an offline step, an application-independent evaluation of vision-based detectors from a system perspective is first performed. Based on this evaluation, the most appropriate algorithms for online object detection for each platform are selected automatically before a mission, taking into account a number of practical system considerations, such as the available communication links, video compression used, and the available computational resources. The detection results are fused using a method for building maps of salient locations which takes advantage of a novel sensor model for vision-based detections for both positive and negative observations. A number of simulated and real flight experiments are also presented, validating the proposed method.
comment: 42 pages, 19 figures
☆ CogVLM2: Visual Language Models for Image and Video Understanding
Beginning with VisualGLM and CogVLM, we are continuously exploring VLMs in pursuit of enhanced vision-language fusion, efficient higher-resolution architecture, and broader modalities and applications. Here we propose the CogVLM2 family, a new generation of visual language models for image and video understanding including CogVLM2, CogVLM2-Video and GLM-4V. As an image understanding model, CogVLM2 inherits the visual expert architecture with improved training recipes in both pre-training and post-training stages, supporting input resolution up to $1344 \times 1344$ pixels. As a video understanding model, CogVLM2-Video integrates multi-frame input with timestamps and proposes automated temporal grounding data construction. Notably, CogVLM2 family has achieved state-of-the-art results on benchmarks like MMBench, MM-Vet, TextVQA, MVBench and VCGBench. All models are open-sourced in https://github.com/THUDM/CogVLM2 and https://github.com/THUDM/GLM-4, contributing to the advancement of the field.
☆ Adapting Vision-Language Models to Open Classes via Test-Time Prompt Tuning
Adapting pre-trained models to open classes is a challenging problem in machine learning. Vision-language models fully explore the knowledge of text modality, demonstrating strong zero-shot recognition performance, which is naturally suited for various open-set problems. More recently, some research focuses on fine-tuning such models to downstream tasks. Prompt tuning methods achieved huge improvements by learning context vectors on few-shot data. However, through the evaluation under open-set adaptation setting with the test data including new classes, we find that there exists a dilemma that learned prompts have worse generalization abilities than hand-crafted prompts. In this paper, we consider combining the advantages of both and come up with a test-time prompt tuning approach, which leverages the maximum concept matching (MCM) scores as dynamic weights to generate an input-conditioned prompt for each image during test. Through extensive experiments on 11 different datasets, we show that our proposed method outperforms all comparison methods on average considering both base and new classes. The code is available at https://github.com/gaozhengqing/TTPT
comment: PRCV 2024
☆ A Deep-Learning-Based Lable-free No-Reference Image Quality Assessment Metric: Application in Sodium MRI Denoising
New multinuclear MRI techniques, such as sodium MRI, generally suffer from low image quality due to an inherently low signal. Postprocessing methods, such as image denoising, have been developed for image enhancement. However, the assessment of these enhanced images is challenging especially considering when there is a lack of high resolution and high signal images as reference, such as in sodium MRI. No-reference Image Quality Assessment (NR-IQA) metrics are approaches to solve this problem. Existing learning-based NR-IQA metrics rely on labels derived from subjective human opinions or metrics like Signal-to-Noise Ratio (SNR), which are either time-consuming or lack accurate ground truths, resulting in unreliable assessment. We note that deep learning (DL) models have a unique characteristic in that they are specialized to a characteristic training set, meaning that deviations between the input testing data from the training data will reduce prediction accuracy. Therefore, we propose a novel DL-based NR-IQA metric, the Model Specialization Metric (MSM), which does not depend on ground-truth images or labels. MSM measures the difference between the input image and the model's prediction for evaluating the quality of the input image. Experiments conducted on both simulated distorted proton T1-weighted MR images and denoised sodium MR images demonstrate that MSM exhibits a superior evaluation performance on various simulated noises and distortions. MSM also has a substantial agreement with the expert evaluations, achieving an averaged Cohen's Kappa coefficient of 0.6528, outperforming the existing NR-IQA metrics.
comment: 13 pages, 3 figures
☆ MICDrop: Masking Image and Depth Features via Complementary Dropout for Domain-Adaptive Semantic Segmentation
Unsupervised Domain Adaptation (UDA) is the task of bridging the domain gap between a labeled source domain, e.g., synthetic data, and an unlabeled target domain. We observe that current UDA methods show inferior results on fine structures and tend to oversegment objects with ambiguous appearance. To address these shortcomings, we propose to leverage geometric information, i.e., depth predictions, as depth discontinuities often coincide with segmentation boundaries. We show that naively incorporating depth into current UDA methods does not fully exploit the potential of this complementary information. To this end, we present MICDrop, which learns a joint feature representation by masking image encoder features while inversely masking depth encoder features. With this simple yet effective complementary masking strategy, we enforce the use of both modalities when learning the joint feature representation. To aid this process, we propose a feature fusion module to improve both global as well as local information sharing while being robust to errors in the depth predictions. We show that our method can be plugged into various recent UDA methods and consistently improve results across standard UDA benchmarks, obtaining new state-of-the-art performances.
☆ Creating a Segmented Pointcloud of Grapevines by Combining Multiple Viewpoints Through Visual Odometry
Grapevine winter pruning is a labor-intensive and repetitive process that significantly influences the quality and quantity of the grape harvest and produced wine of the following season. It requires a careful and expert detection of the point to be cut. Because of its complexity, repetitive nature and time constraint, the task requires skilled labor that needs to be trained. This extended abstract presents the computer vision pipeline employed in project Vinum, using detectron2 as a segmentation network and keypoint visual odometry to merge different observation into a single pointcloud used to make informed pruning decisions.
☆ Improving 3D deep learning segmentation with biophysically motivated cell synthesis
Biomedical research increasingly relies on 3D cell culture models and AI-based analysis can potentially facilitate a detailed and accurate feature extraction on a single-cell level. However, this requires for a precise segmentation of 3D cell datasets, which in turn demands high-quality ground truth for training. Manual annotation, the gold standard for ground truth data, is too time-consuming and thus not feasible for the generation of large 3D training datasets. To address this, we present a novel framework for generating 3D training data, which integrates biophysical modeling for realistic cell shape and alignment. Our approach allows the in silico generation of coherent membrane and nuclei signals, that enable the training of segmentation models utilizing both channels for improved performance. Furthermore, we present a new GAN training scheme that generates not only image data but also matching labels. Quantitative evaluation shows superior performance of biophysical motivated synthetic training data, even outperforming manual annotation and pretrained models. This underscores the potential of incorporating biophysical modeling for enhancing synthetic training data quality.
☆ Multi-source Domain Adaptation for Panoramic Semantic Segmentation
Panoramic semantic segmentation has received widespread attention recently due to its comprehensive 360\degree field of view. However, labeling such images demands greater resources compared to pinhole images. As a result, many unsupervised domain adaptation methods for panoramic semantic segmentation have emerged, utilizing real pinhole images or low-cost synthetic panoramic images. But, the segmentation model lacks understanding of the panoramic structure when only utilizing real pinhole images, and it lacks perception of real-world scenes when only adopting synthetic panoramic images. Therefore, in this paper, we propose a new task of multi-source domain adaptation for panoramic semantic segmentation, aiming to utilize both real pinhole and synthetic panoramic images in the source domains, enabling the segmentation model to perform well on unlabeled real panoramic images in the target domain. Further, we propose Deformation Transform Aligner for Panoramic Semantic Segmentation (DTA4PASS), which converts all pinhole images in the source domains into panoramic-like images, and then aligns the converted source domains with the target domain. Specifically, DTA4PASS consists of two main components: Unpaired Semantic Morphing (USM) and Distortion Gating Alignment (DGA). Firstly, in USM, the Semantic Dual-view Discriminator (SDD) assists in training the diffeomorphic deformation network, enabling the effective transformation of pinhole images without paired panoramic views. Secondly, DGA assigns pinhole-like and panoramic-like features to each image by gating, and aligns these two features through uncertainty estimation. DTA4PASS outperforms the previous state-of-the-art methods by 1.92% and 2.19% on the outdoor and indoor multi-source domain adaptation scenarios, respectively. The source code will be released.
comment: 9 pages, 7 figures, 5 tables
☆ Spiking Diffusion Models
Recent years have witnessed Spiking Neural Networks (SNNs) gaining attention for their ultra-low energy consumption and high biological plausibility compared with traditional Artificial Neural Networks (ANNs). Despite their distinguished properties, the application of SNNs in the computationally intensive field of image generation is still under exploration. In this paper, we propose the Spiking Diffusion Models (SDMs), an innovative family of SNN-based generative models that excel in producing high-quality samples with significantly reduced energy consumption. In particular, we propose a Temporal-wise Spiking Mechanism (TSM) that allows SNNs to capture more temporal features from a bio-plasticity perspective. In addition, we propose a threshold-guided strategy that can further improve the performances by up to 16.7% without any additional training. We also make the first attempt to use the ANN-SNN approach for SNN-based generation tasks. Extensive experimental results reveal that our approach not only exhibits comparable performance to its ANN counterpart with few spiking time steps, but also outperforms previous SNN-based generative models by a large margin. Moreover, we also demonstrate the high-quality generation ability of SDM on large-scale datasets, e.g., LSUN bedroom. This development marks a pivotal advancement in the capabilities of SNN-based generation, paving the way for future research avenues to realize low-energy and low-latency generative applications. Our code is available at https://github.com/AndyCao1125/SDM.
comment: Accepted by IEEE Transactions on Artificial Intelligence
☆ Weakly Supervised Object Detection for Automatic Tooth-marked Tongue Recognition
Tongue diagnosis in Traditional Chinese Medicine (TCM) is a crucial diagnostic method that can reflect an individual's health status. Traditional methods for identifying tooth-marked tongues are subjective and inconsistent because they rely on practitioner experience. We propose a novel fully automated Weakly Supervised method using Vision transformer and Multiple instance learning WSVM for tongue extraction and tooth-marked tongue recognition. Our approach first accurately detects and extracts the tongue region from clinical images, removing any irrelevant background information. Then, we implement an end-to-end weakly supervised object detection method. We utilize Vision Transformer (ViT) to process tongue images in patches and employ multiple instance loss to identify tooth-marked regions with only image-level annotations. WSVM achieves high accuracy in tooth-marked tongue classification, and visualization experiments demonstrate its effectiveness in pinpointing these regions. This automated approach enhances the objectivity and accuracy of tooth-marked tongue diagnosis. It provides significant clinical value by assisting TCM practitioners in making precise diagnoses and treatment recommendations. Code is available at https://github.com/yc-zh/WSVM.
☆ What to Preserve and What to Transfer: Faithful, Identity-Preserving Diffusion-based Hairstyle Transfer
Hairstyle transfer is a challenging task in the image editing field that modifies the hairstyle of a given face image while preserving its other appearance and background features. The existing hairstyle transfer approaches heavily rely on StyleGAN, which is pre-trained on cropped and aligned face images. Hence, they struggle to generalize under challenging conditions such as extreme variations of head poses or focal lengths. To address this issue, we propose a one-stage hairstyle transfer diffusion model, HairFusion, that applies to real-world scenarios. Specifically, we carefully design a hair-agnostic representation as the input of the model, where the original hair information is thoroughly eliminated. Next, we introduce a hair align cross-attention (Align-CA) to accurately align the reference hairstyle with the face image while considering the difference in their face shape. To enhance the preservation of the face image's original features, we leverage adaptive hair blending during the inference, where the output's hair regions are estimated by the cross-attention map in Align-CA and blended with non-hair areas of the face image. Our experimental results show that our method achieves state-of-the-art performance compared to the existing methods in preserving the integrity of both the transferred hairstyle and the surrounding features. The codes are available at https://github.com/cychungg/HairFusion.
☆ Enhancing Sound Source Localization via False Negative Elimination
Sound source localization aims to localize objects emitting the sound in visual scenes. Recent works obtaining impressive results typically rely on contrastive learning. However, the common practice of randomly sampling negatives in prior arts can lead to the false negative issue, where the sounds semantically similar to visual instance are sampled as negatives and incorrectly pushed away from the visual anchor/query. As a result, this misalignment of audio and visual features could yield inferior performance. To address this issue, we propose a novel audio-visual learning framework which is instantiated with two individual learning schemes: self-supervised predictive learning (SSPL) and semantic-aware contrastive learning (SACL). SSPL explores image-audio positive pairs alone to discover semantically coherent similarities between audio and visual features, while a predictive coding module for feature alignment is introduced to facilitate the positive-only learning. In this regard SSPL acts as a negative-free method to eliminate false negatives. By contrast, SACL is designed to compact visual features and remove false negatives, providing reliable visual anchor and audio negatives for contrast. Different from SSPL, SACL releases the potential of audio-visual contrastive learning, offering an effective alternative to achieve the same goal. Comprehensive experiments demonstrate the superiority of our approach over the state-of-the-arts. Furthermore, we highlight the versatility of the learned representation by extending the approach to audio-visual event classification and object detection tasks. Code and models are available at: https://github.com/zjsong/SACL.
comment: arXiv admin note: substantial text overlap with arXiv:2203.13412
☆ Mismatched: Evaluating the Limits of Image Matching Approaches and Benchmarks
Three-dimensional (3D) reconstruction from two-dimensional images is an active research field in computer vision, with applications ranging from navigation and object tracking to segmentation and three-dimensional modeling. Traditionally, parametric techniques have been employed for this task. However, recent advancements have seen a shift towards learning-based methods. Given the rapid pace of research and the frequent introduction of new image matching methods, it is essential to evaluate them. In this paper, we present a comprehensive evaluation of various image matching methods using a structure-from-motion pipeline. We assess the performance of these methods on both in-domain and out-of-domain datasets, identifying key limitations in both the methods and benchmarks. We also investigate the impact of edge detection as a pre-processing step. Our analysis reveals that image matching for 3D reconstruction remains an open challenge, necessitating careful selection and tuning of models for specific scenarios, while also highlighting mismatches in how metrics currently represent method performance.
comment: 19 pages, 5 figures
☆ Integrating Features for Recognizing Human Activities through Optimized Parameters in Graph Convolutional Networks and Transformer Architectures
Human activity recognition is a major field of study that employs computer vision, machine vision, and deep learning techniques to categorize human actions. The field of deep learning has made significant progress, with architectures that are extremely effective at capturing human dynamics. This study emphasizes the influence of feature fusion on the accuracy of activity recognition. This technique addresses the limitation of conventional models, which face difficulties in identifying activities because of their limited capacity to understand spatial and temporal features. The technique employs sensory data obtained from four publicly available datasets: HuGaDB, PKU-MMD, LARa, and TUG. The accuracy and F1-score of two deep learning models, specifically a Transformer model and a Parameter-Optimized Graph Convolutional Network (PO-GCN), were evaluated using these datasets. The feature fusion technique integrated the final layer features from both models and inputted them into a classifier. Empirical evidence demonstrates that PO-GCN outperforms standard models in activity recognition. HuGaDB demonstrated a 2.3% improvement in accuracy and a 2.2% increase in F1-score. TUG showed a 5% increase in accuracy and a 0.5% rise in F1-score. On the other hand, LARa and PKU-MMD achieved lower accuracies of 64% and 69% respectively. This indicates that the integration of features enhanced the performance of both the Transformer model and PO-GCN.
comment: 6 pages, 1 figure, conference
☆ Discriminative Spatial-Semantic VOS Solution: 1st Place Solution for 6th LSVOS
Video object segmentation (VOS) is a crucial task in computer vision, but current VOS methods struggle with complex scenes and prolonged object motions. To address these challenges, the MOSE dataset aims to enhance object recognition and differentiation in complex environments, while the LVOS dataset focuses on segmenting objects exhibiting long-term, intricate movements. This report introduces a discriminative spatial-temporal VOS model that utilizes discriminative object features as query representations. The semantic understanding of spatial-semantic modules enables it to recognize object parts, while salient features highlight more distinctive object characteristics. Our model, trained on extensive VOS datasets, achieved first place (\textbf{80.90\%} $\mathcal{J \& F}$) on the test set of the 6th LSVOS challenge in the VOS Track, demonstrating its effectiveness in tackling the aforementioned challenges. The code will be available at \href{https://github.com/yahooo-m/VOS-Solution}{code}.
comment: 1st Place Solution for 6th LSVOS VOS Track. arXiv admin note: substantial text overlap with arXiv:2406.04600
☆ COIN: Control-Inpainting Diffusion Prior for Human and Camera Motion Estimation ECCV 2024
Estimating global human motion from moving cameras is challenging due to the entanglement of human and camera motions. To mitigate the ambiguity, existing methods leverage learned human motion priors, which however often result in oversmoothed motions with misaligned 2D projections. To tackle this problem, we propose COIN, a control-inpainting motion diffusion prior that enables fine-grained control to disentangle human and camera motions. Although pre-trained motion diffusion models encode rich motion priors, we find it non-trivial to leverage such knowledge to guide global motion estimation from RGB videos. COIN introduces a novel control-inpainting score distillation sampling method to ensure well-aligned, consistent, and high-quality motion from the diffusion prior within a joint optimization framework. Furthermore, we introduce a new human-scene relation loss to alleviate the scale ambiguity by enforcing consistency among the humans, camera, and scene. Experiments on three challenging benchmarks demonstrate the effectiveness of COIN, which outperforms the state-of-the-art methods in terms of global human motion estimation and camera motion estimation. As an illustrative example, COIN outperforms the state-of-the-art method by 33% in world joint position error (W-MPJPE) on the RICH dataset.
comment: ECCV 2024
☆ Text-Enhanced Zero-Shot Action Recognition: A training-free approach ICPR 2024
Vision-language models (VLMs) have demonstrated remarkable performance across various visual tasks, leveraging joint learning of visual and textual representations. While these models excel in zero-shot image tasks, their application to zero-shot video action recognition (ZSVAR) remains challenging due to the dynamic and temporal nature of actions. Existing methods for ZS-VAR typically require extensive training on specific datasets, which can be resource-intensive and may introduce domain biases. In this work, we propose Text-Enhanced Action Recognition (TEAR), a simple approach to ZS-VAR that is training-free and does not require the availability of training data or extensive computational resources. Drawing inspiration from recent findings in vision and language literature, we utilize action descriptors for decomposition and contextual information to enhance zero-shot action recognition. Through experiments on UCF101, HMDB51, and Kinetics-600 datasets, we showcase the effectiveness and applicability of our proposed approach in addressing the challenges of ZS-VAR.
comment: accepted to ICPR 2024
☆ IBO: Inpainting-Based Occlusion to Enhance Explainable Artificial Intelligence Evaluation in Histopathology
Histopathological image analysis is crucial for accurate cancer diagnosis and treatment planning. While deep learning models, especially convolutional neural networks, have advanced this field, their "black-box" nature raises concerns about interpretability and trustworthiness. Explainable Artificial Intelligence (XAI) techniques aim to address these concerns, but evaluating their effectiveness remains challenging. A significant issue with current occlusion-based XAI methods is that they often generate Out-of-Distribution (OoD) samples, leading to inaccurate evaluations. In this paper, we introduce Inpainting-Based Occlusion (IBO), a novel occlusion strategy that utilizes a Denoising Diffusion Probabilistic Model to inpaint occluded regions in histopathological images. By replacing cancerous areas with realistic, non-cancerous tissue, IBO minimizes OoD artifacts and preserves data integrity. We evaluate our method on the CAMELYON16 dataset through two phases: first, by assessing perceptual similarity using the Learned Perceptual Image Patch Similarity (LPIPS) metric, and second, by quantifying the impact on model predictions through Area Under the Curve (AUC) analysis. Our results demonstrate that IBO significantly improves perceptual fidelity, achieving nearly twice the improvement in LPIPS scores compared to the best existing occlusion strategy. Additionally, IBO increased the precision of XAI performance prediction from 42% to 71% compared to traditional methods. These results demonstrate IBO's potential to provide more reliable evaluations of XAI techniques, benefiting histopathology and other applications. The source code for this study is available at https://github.com/a-fsh-r/IBO.
comment: 19 pages, 6 figures
☆ Exploiting temporal information to detect conversational groups in videos and predict the next speaker
Studies in human human interaction have introduced the concept of F formation to describe the spatial arrangement of participants during social interactions. This paper has two objectives. It aims at detecting F formations in video sequences and predicting the next speaker in a group conversation. The proposed approach exploits time information and human multimodal signals in video sequences. In particular, we rely on measuring the engagement level of people as a feature of group belonging. Our approach makes use of a recursive neural network, the Long Short Term Memory (LSTM), to predict who will take the speaker's turn in a conversation group. Experiments on the MatchNMingle dataset led to 85% true positives in group detection and 98% accuracy in predicting the next speaker.
comment: Accepted to Pattern Recognition Letter, 8 pages, 10 figures
☆ Law of Vision Representation in MLLMs
We present the "Law of Vision Representation" in multimodal large language models (MLLMs). It reveals a strong correlation between the combination of cross-modal alignment, correspondence in vision representation, and MLLM performance. We quantify the two factors using the cross-modal Alignment and Correspondence score (AC score). Through extensive experiments involving thirteen different vision representation settings and evaluations across eight benchmarks, we find that the AC score is linearly correlated to model performance. By leveraging this relationship, we are able to identify and train the optimal vision representation only, which does not require finetuning the language model every time, resulting in a 99.7% reduction in computational cost.
comment: The code is available at https://github.com/bronyayang/Law_of_Vision_Representation_in_MLLMs
☆ NeRF-CA: Dynamic Reconstruction of X-ray Coronary Angiography with Extremely Sparse-views
Dynamic three-dimensional (4D) reconstruction from two-dimensional X-ray coronary angiography (CA) remains a significant clinical problem. Challenges include sparse-view settings, intra-scan motion, and complex vessel morphology such as structure sparsity and background occlusion. Existing CA reconstruction methods often require extensive user interaction or large training datasets. On the other hand, Neural Radiance Field (NeRF), a promising deep learning technique, has successfully reconstructed high-fidelity static scenes for natural and medical scenes. Recent work, however, identified that sparse-views, background occlusion, and dynamics still pose a challenge when applying NeRF in the X-ray angiography context. Meanwhile, many successful works for natural scenes propose regularization for sparse-view reconstruction or scene decomposition to handle dynamics. However, these techniques do not directly translate to the CA context, where both challenges and background occlusion are significant. This paper introduces NeRF-CA, the first step toward a 4D CA reconstruction method that achieves reconstructions from sparse coronary angiograms with cardiac motion. We leverage the motion of the coronary artery to decouple the scene into a dynamic coronary artery component and static background. We combine this scene decomposition with tailored regularization techniques. These techniques enforce the separation of the coronary artery from the background by enforcing dynamic structure sparsity and scene smoothness. By uniquely combining these approaches, we achieve 4D reconstructions from as few as four angiogram sequences. This setting aligns with clinical workflows while outperforming state-of-the-art X-ray sparse-view NeRF reconstruction techniques. We validate our approach quantitatively and qualitatively using 4D phantom datasets and ablation studies.
☆ Toward Robust Early Detection of Alzheimer's Disease via an Integrated Multimodal Learning Approach
Alzheimer's Disease (AD) is a complex neurodegenerative disorder marked by memory loss, executive dysfunction, and personality changes. Early diagnosis is challenging due to subtle symptoms and varied presentations, often leading to misdiagnosis with traditional unimodal diagnostic methods due to their limited scope. This study introduces an advanced multimodal classification model that integrates clinical, cognitive, neuroimaging, and EEG data to enhance diagnostic accuracy. The model incorporates a feature tagger with a tabular data coding architecture and utilizes the TimesBlock module to capture intricate temporal patterns in Electroencephalograms (EEG) data. By employing Cross-modal Attention Aggregation module, the model effectively fuses Magnetic Resonance Imaging (MRI) spatial information with EEG temporal data, significantly improving the distinction between AD, Mild Cognitive Impairment, and Normal Cognition. Simultaneously, we have constructed the first AD classification dataset that includes three modalities: EEG, MRI, and tabular data. Our innovative approach aims to facilitate early diagnosis and intervention, potentially slowing the progression of AD. The source code and our private ADMC dataset are available at https://github.com/JustlfC03/MSTNet.
comment: 5 pages, 2 figures
☆ Learned Image Transmission with Hierarchical Variational Autoencoder
In this paper, we introduce an innovative hierarchical joint source-channel coding (HJSCC) framework for image transmission, utilizing a hierarchical variational autoencoder (VAE). Our approach leverages a combination of bottom-up and top-down paths at the transmitter to autoregressively generate multiple hierarchical representations of the original image. These representations are then directly mapped to channel symbols for transmission by the JSCC encoder. We extend this framework to scenarios with a feedback link, modeling transmission over a noisy channel as a probabilistic sampling process and deriving a novel generative formulation for JSCC with feedback. Compared with existing approaches, our proposed HJSCC provides enhanced adaptability by dynamically adjusting transmission bandwidth, encoding these representations into varying amounts of channel symbols. Additionally, we introduce a rate attention module to guide the JSCC encoder in optimizing its encoding strategy based on prior information. Extensive experiments on images of varying resolutions demonstrate that our proposed model outperforms existing baselines in rate-distortion performance and maintains robustness against channel noise.
☆ P2P-Bridge: Diffusion Bridges for 3D Point Cloud Denoising ECCV 2024
In this work, we tackle the task of point cloud denoising through a novel framework that adapts Diffusion Schr\"odinger bridges to points clouds. Unlike previous approaches that predict point-wise displacements from point features or learned noise distributions, our method learns an optimal transport plan between paired point clouds. Experiments on object datasets like PU-Net and real-world datasets such as ScanNet++ and ARKitScenes show that P2P-Bridge achieves significant improvements over existing methods. While our approach demonstrates strong results using only point coordinates, we also show that incorporating additional features, such as color information or point-wise DINOv2 features, further enhances the performance. Code and pretrained models are available at https://p2p-bridge.github.io.
comment: ECCV 2024 Project page: https://p2p-bridge.github.io
☆ BEVal: A Cross-dataset Evaluation Study of BEV Segmentation Models for Autononomous Driving
Current research in semantic bird's-eye view segmentation for autonomous driving focuses solely on optimizing neural network models using a single dataset, typically nuScenes. This practice leads to the development of highly specialized models that may fail when faced with different environments or sensor setups, a problem known as domain shift. In this paper, we conduct a comprehensive cross-dataset evaluation of state-of-the-art BEV segmentation models to assess their performance across different training and testing datasets and setups, as well as different semantic categories. We investigate the influence of different sensors, such as cameras and LiDAR, on the models' ability to generalize to diverse conditions and scenarios. Additionally, we conduct multi-dataset training experiments that improve models' BEV segmentation performance compared to single-dataset training. Our work addresses the gap in evaluating BEV segmentation models under cross-dataset validation. And our findings underscore the importance of enhancing model generalizability and adaptability to ensure more robust and reliable BEV segmentation approaches for autonomous driving applications.
☆ ResVG: Enhancing Relation and Semantic Understanding in Multiple Instances for Visual Grounding ACM MM 2024
Visual grounding aims to localize the object referred to in an image based on a natural language query. Although progress has been made recently, accurately localizing target objects within multiple-instance distractions (multiple objects of the same category as the target) remains a significant challenge. Existing methods demonstrate a significant performance drop when there are multiple distractions in an image, indicating an insufficient understanding of the fine-grained semantics and spatial relationships between objects. In this paper, we propose a novel approach, the Relation and Semantic-sensitive Visual Grounding (ResVG) model, to address this issue. Firstly, we enhance the model's understanding of fine-grained semantics by injecting semantic prior information derived from text queries into the model. This is achieved by leveraging text-to-image generation models to produce images representing the semantic attributes of target objects described in queries. Secondly, we tackle the lack of training samples with multiple distractions by introducing a relation-sensitive data augmentation method. This method generates additional training data by synthesizing images containing multiple objects of the same category and pseudo queries based on their spatial relationships. The proposed ReSVG model significantly improves the model's ability to comprehend both object semantics and spatial relations, leading to enhanced performance in visual grounding tasks, particularly in scenarios with multiple-instance distractions. We conduct extensive experiments to validate the effectiveness of our methods on five datasets. Code is available at https://github.com/minghangz/ResVG.
comment: Accepted by ACM MM 2024
☆ FA-YOLO: Research On Efficient Feature Selection YOLO Improved Algorithm Based On FMDS and AGMF Modules
Over the past few years, the YOLO series of models has emerged as one of the dominant methodologies in the realm of object detection. Many studies have advanced these baseline models by modifying their architectures, enhancing data quality, and developing new loss functions. However, current models still exhibit deficiencies in processing feature maps, such as overlooking the fusion of cross-scale features and a static fusion approach that lacks the capability for dynamic feature adjustment. To address these issues, this paper introduces an efficient Fine-grained Multi-scale Dynamic Selection Module (FMDS Module), which applies a more effective dynamic feature selection and fusion method on fine-grained multi-scale feature maps, significantly enhancing the detection accuracy of small, medium, and large-sized targets in complex environments. Furthermore, this paper proposes an Adaptive Gated Multi-branch Focus Fusion Module (AGMF Module), which utilizes multiple parallel branches to perform complementary fusion of various features captured by the gated unit branch, FMDS Module branch, and TripletAttention branch. This approach further enhances the comprehensiveness, diversity, and integrity of feature fusion. This paper has integrated the FMDS Module, AGMF Module, into Yolov9 to develop a novel object detection model named FA-YOLO. Extensive experimental results show that under identical experimental conditions, FA-YOLO achieves an outstanding 66.1% mean Average Precision (mAP) on the PASCAL VOC 2007 dataset, representing 1.0% improvement over YOLOv9's 65.1%. Additionally, the detection accuracies of FA-YOLO for small, medium, and large targets are 44.1%, 54.6%, and 70.8%, respectively, showing improvements of 2.0%, 3.1%, and 0.9% compared to YOLOv9's 42.1%, 51.5%, and 69.9%.
comment: 11 pages and 4 figures
☆ Bootstrap Segmentation Foundation Model under Distribution Shift via Object-Centric Learning ECCV 2024
Foundation models have made incredible strides in achieving zero-shot or few-shot generalization, leveraging prompt engineering to mimic the problem-solving approach of human intelligence. However, when it comes to some foundation models like Segment Anything, there is still a challenge in performing well on out-of-distribution data, including camouflaged and medical images. Inconsistent prompting strategies during fine-tuning and testing further compound the issue, leading to decreased performance. Drawing inspiration from how human cognition processes new environments, we introduce SlotSAM, a method that reconstructs features from the encoder in a self-supervised manner to create object-centric representations. These representations are then integrated into the foundation model, bolstering its object-level perceptual capabilities while reducing the impact of distribution-related variables. The beauty of SlotSAM lies in its simplicity and adaptability to various tasks, making it a versatile solution that significantly enhances the generalization abilities of foundation models. Through limited parameter fine-tuning in a bootstrap manner, our approach paves the way for improved generalization in novel environments. The code is available at github.com/lytang63/SlotSAM.
comment: This work is accepted by ECCV 2024 EVAL-FoMo Workshop
☆ Semantics-Oriented Multitask Learning for DeepFake Detection: A Joint Embedding Approach
In recent years, the multimedia forensics and security community has seen remarkable progress in multitask learning for DeepFake (i.e., face forgery) detection. The prevailing strategy has been to frame DeepFake detection as a binary classification problem augmented by manipulation-oriented auxiliary tasks. This strategy focuses on learning features specific to face manipulations, which exhibit limited generalizability. In this paper, we delve deeper into semantics-oriented multitask learning for DeepFake detection, leveraging the relationships among face semantics via joint embedding. We first propose an automatic dataset expansion technique that broadens current face forgery datasets to support semantics-oriented DeepFake detection tasks at both the global face attribute and local face region levels. Furthermore, we resort to joint embedding of face images and their corresponding labels (depicted by textual descriptions) for prediction. This approach eliminates the need for manually setting task-agnostic and task-specific parameters typically required when predicting labels directly from images. In addition, we employ a bi-level optimization strategy to dynamically balance the fidelity loss weightings of various tasks, making the training process fully automated. Extensive experiments on six DeepFake datasets show that our method improves the generalizability of DeepFake detection and, meanwhile, renders some degree of model interpretation by providing human-understandable explanations.
☆ Enhanced Control for Diffusion Bridge in Image Restoration
Image restoration refers to the process of restoring a damaged low-quality image back to its corresponding high-quality image. Typically, we use convolutional neural networks to directly learn the mapping from low-quality images to high-quality images achieving image restoration. Recently, a special type of diffusion bridge model has achieved more advanced results in image restoration. It can transform the direct mapping from low-quality to high-quality images into a diffusion process, restoring low-quality images through a reverse process. However, the current diffusion bridge restoration models do not emphasize the idea of conditional control, which may affect performance. This paper introduces the ECDB model enhancing the control of the diffusion bridge with low-quality images as conditions. Moreover, in response to the characteristic of diffusion models having low denoising level at larger values of \(\bm t \), we also propose a Conditional Fusion Schedule, which more effectively handles the conditional feature information of various modules. Experimental results prove that the ECDB model has achieved state-of-the-art results in many image restoration tasks, including deraining, inpainting and super-resolution. Code is avaliable at https://github.com/Hammour-steak/ECDB.
☆ Rethinking Sparse Lexical Representations for Image Retrieval in the Age of Rising Multi-Modal Large Language Models ECCV 2024
In this paper, we rethink sparse lexical representations for image retrieval. By utilizing multi-modal large language models (M-LLMs) that support visual prompting, we can extract image features and convert them into textual data, enabling us to utilize efficient sparse retrieval algorithms employed in natural language processing for image retrieval tasks. To assist the LLM in extracting image features, we apply data augmentation techniques for key expansion and analyze the impact with a metric for relevance between images and textual data. We empirically show the superior precision and recall performance of our image retrieval method compared to conventional vision-language model-based methods on the MS-COCO, PASCAL VOC, and NUS-WIDE datasets in a keyword-based image retrieval scenario, where keywords serve as search queries. We also demonstrate that the retrieval performance can be improved by iteratively incorporating keywords into search queries.
comment: Accepted to ECCV 2024 Workshops: 2nd Workshop on Traditional Computer Vision in the Age of Deep Learning (TradiCV)
☆ Convolutional Neural Network Compression Based on Low-Rank Decomposition
Deep neural networks typically impose significant computational loads and memory consumption. Moreover, the large parameters pose constraints on deploying the model on edge devices such as embedded systems. Tensor decomposition offers a clear advantage in compressing large-scale weight tensors. Nevertheless, direct utilization of low-rank decomposition typically leads to significant accuracy loss. This paper proposes a model compression method that integrates Variational Bayesian Matrix Factorization (VBMF) with orthogonal regularization. Initially, the model undergoes over-parameterization and training, with orthogonal regularization applied to enhance its likelihood of achieving the accuracy of the original model. Secondly, VBMF is employed to estimate the rank of the weight tensor at each layer. Our framework is sufficiently general to apply to other convolutional neural networks and easily adaptable to incorporate other tensor decomposition methods. Experimental results show that for both high and low compression ratios, our compression model exhibits advanced performance.
comment: 10 pages, 1 figures
☆ Fine-grained Classification of Port Wine Stains Using Optical Coherence Tomography Angiography
Accurate classification of port wine stains (PWS, vascular malformations present at birth), is critical for subsequent treatment planning. However, the current method of classifying PWS based on the external skin appearance rarely reflects the underlying angiopathological heterogeneity of PWS lesions, resulting in inconsistent outcomes with the common vascular-targeted photodynamic therapy (V-PDT) treatments. Conversely, optical coherence tomography angiography (OCTA) is an ideal tool for visualizing the vascular malformations of PWS. Previous studies have shown no significant correlation between OCTA quantitative metrics and the PWS subtypes determined by the current classification approach. This study proposes a new classification approach for PWS using both OCT and OCTA. By examining the hypodermic histopathology and vascular structure of PWS, we have devised a fine-grained classification method that subdivides PWS into five distinct types. To assess the angiopathological differences of various PWS subtypes, we have analyzed six metrics related to vascular morphology and depth information of PWS lesions. The five PWS types present significant differences across all metrics compared to the conventional subtypes. Our findings suggest that an angiopathology-based classification accurately reflects the heterogeneity in PWS lesions. This research marks the first attempt to classify PWS based on angiopathology, potentially guiding more effective subtyping and treatment strategies for PWS.
comment: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible
☆ SAU: A Dual-Branch Network to Enhance Long-Tailed Recognition via Generative Models
Long-tailed distributions in image recognition pose a considerable challenge due to the severe imbalance between a few dominant classes with numerous examples and many minority classes with few samples. Recently, the use of large generative models to create synthetic data for image classification has been realized, but utilizing synthetic data to address the challenge of long-tailed recognition remains relatively unexplored. In this work, we proposed the use of synthetic data as a complement to long-tailed datasets to eliminate the impact of data imbalance. To tackle this real-synthetic mixed dataset, we designed a two-branch model that contains Synthetic-Aware and Unaware branches (SAU). The core ideas are (1) a synthetic-unaware branch for classification that mixes real and synthetic data and treats all data equally without distinguishing between them. (2) A synthetic-aware branch for improving the robustness of the feature extractor by distinguishing between real and synthetic data and learning their discrepancies. Extensive experimental results demonstrate that our method can improve the accuracy of long-tailed image recognition. Notably, our approach achieves state-of-the-art Top-1 accuracy and significantly surpasses other methods on CIFAR-10-LT and CIFAR-100-LT datasets across various imbalance factors. Our code is available at https://github.com/lgX1123/gm4lt.
comment: 15 pages
☆ Beyond Uncertainty: Evidential Deep Learning for Robust Video Temporal Grounding
Existing Video Temporal Grounding (VTG) models excel in accuracy but often overlook open-world challenges posed by open-vocabulary queries and untrimmed videos. This leads to unreliable predictions for noisy, corrupted, and out-of-distribution data. Adapting VTG models to dynamically estimate uncertainties based on user input can address this issue. To this end, we introduce SRAM, a robust network module that benefits from a two-stage cross-modal alignment task. More importantly, it integrates Deep Evidential Regression (DER) to explicitly and thoroughly quantify uncertainty during training, thus allowing the model to say "I do not know" in scenarios beyond its handling capacity. However, the direct application of traditional DER theory and its regularizer reveals structural flaws, leading to unintended constraints in VTG tasks. In response, we develop a simple yet effective Geom-regularizer that enhances the uncertainty learning framework from the ground up. To the best of our knowledge, this marks the first successful attempt of DER in VTG. Our extensive quantitative and qualitative results affirm the effectiveness, robustness, and interpretability of our modules and the uncertainty learning paradigm in VTG tasks. The code will be made available.
comment: Ongoing work: 28pages, 19 figures, 7 tables. Code is available at: https://kaijing.space/SRAM/
☆ UDD: Dataset Distillation via Mining Underutilized Regions
Dataset distillation synthesizes a small dataset such that a model trained on this set approximates the performance of the original dataset. Recent studies on dataset distillation focused primarily on the design of the optimization process, with methods such as gradient matching, feature alignment, and training trajectory matching. However, little attention has been given to the issue of underutilized regions in synthetic images. In this paper, we propose UDD, a novel approach to identify and exploit the underutilized regions to make them informative and discriminate, and thus improve the utilization of the synthetic dataset. Technically, UDD involves two underutilized regions searching policies for different conditions, i.e., response-based policy and data jittering-based policy. Compared with previous works, such two policies are utilization-sensitive, equipping with the ability to dynamically adjust the underutilized regions during the training process. Additionally, we analyze the current model optimization problem and design a category-wise feature contrastive loss, which can enhance the distinguishability of different categories and alleviate the shortcomings of the existing multi-formation methods. Experimentally, our method improves the utilization of the synthetic dataset and outperforms the state-of-the-art methods on various datasets, such as MNIST, FashionMNIST, SVHN, CIFAR-10, and CIFAR-100. For example, the improvements on CIFAR-10 and CIFAR-100 are 4.0\% and 3.7\% over the next best method with IPC=1, by mining the underutilized regions.
comment: PRCV2024
☆ Improving Diffusion-based Data Augmentation with Inversion Spherical Interpolation
Data Augmentation (DA), \ie, synthesizing faithful and diverse samples to expand the original training set, is a prevalent and effective strategy to improve various visual recognition tasks. With the powerful image generation ability, diffusion-based DA has shown strong performance gains on different benchmarks. In this paper, we analyze today's diffusion-based DA methods, and argue that they cannot take account of both faithfulness and diversity, which are two critical keys for generating high-quality samples and boosting final classification performance. To this end, we propose a novel Diffusion-based Inversion Interpolation DA method: Diff-II. Specifically, Diff-II consists of three main steps: 1) Category concepts learning: Learning concept embeddings for each category. 2) Inversion interpolation: Calculating the inversion for each image, and conducting spherical interpolation for two randomly sampled inversions from the same category. 3) Two-stage denoising: Using different prompts to generate synthesized images in a coarse-to-fine manner. Extensive experiments on multiple image classification tasks (\eg, few-shot, long-tailed, and out-of-distribution classification) have demonstrated its effectiveness over state-of-the-art diffusion-based DA methods.
☆ Low Saturation Confidence Distribution-based Test-Time Adaptation for Cross-Domain Remote Sensing Image Classification
Although the Unsupervised Domain Adaptation (UDA) method has improved the effect of remote sensing image classification tasks, most of them are still limited by access to the source domain (SD) data. Designs such as Source-free Domain Adaptation (SFDA) solve the challenge of a lack of SD data, however, they still rely on a large amount of target domain data and thus cannot achieve fast adaptations, which seriously hinders their further application in broader scenarios. The real-world applications of cross-domain remote sensing image classification require a balance of speed and accuracy at the same time. Therefore, we propose a novel and comprehensive test time adaptation (TTA) method -- Low Saturation Confidence Distribution Test Time Adaptation (LSCD-TTA), which is the first attempt to solve such scenarios through the idea of TTA. LSCD-TTA specifically considers the distribution characteristics of remote sensing images, including three main parts that concentrate on different optimization directions: First, low saturation distribution (LSD) considers the dominance of low-confidence samples during the later TTA stage. Second, weak-category cross-entropy (WCCE) increases the weight of categories that are more difficult to classify with less prior knowledge. Finally, diverse categories confidence (DIV) comprehensively considers the category diversity to alleviate the deviation of the sample distribution. By weighting the abovementioned three modules, the model can widely, quickly and accurately adapt to the target domain without much prior target distributions, repeated data access, and manual annotation. We evaluate LSCD-TTA on three remote-sensing image datasets. The experimental results show that LSCD-TTA achieves a significant gain of 4.96%-10.51% with Resnet-50 and 5.33%-12.49% with Resnet-101 in average accuracy compared to other state-of-the-art DA and TTA methods.
☆ Advancing Architectural Floorplan Design with Geometry-enhanced Graph Diffusion
Automating architectural floorplan design is vital for housing and interior design, offering a faster, cost-effective alternative to manual sketches by architects. However, existing methods, including rule-based and learning-based approaches, face challenges in design complexity and constrained generation with extensive post-processing, and tend to obvious geometric inconsistencies such as misalignment, overlap, and gaps. In this work, we propose a novel generative framework for vector floorplan design via structural graph generation, called GSDiff, focusing on wall junction generation and wall segment prediction to capture both geometric and semantic aspects of structural graphs. To improve the geometric rationality of generated structural graphs, we propose two innovative geometry enhancement methods. In wall junction generation, we propose a novel alignment loss function to improve geometric consistency. In wall segment prediction, we propose a random self-supervision method to enhance the model's perception of the overall geometric structure, thereby promoting the generation of reasonable geometric structures. Employing the diffusion model and the Transformer model, as well as the geometry enhancement strategies, our framework can generate wall junctions, wall segments and room polygons with structural and semantic information, resulting in structural graphs that accurately represent floorplans. Extensive experiments show that the proposed method surpasses existing techniques, enabling free generation and constrained generation, marking a shift towards structure generation in architectural design.
☆ EvLight++: Low-Light Video Enhancement with an Event Camera: A Large-Scale Real-World Dataset, Novel Method, and More
Event cameras offer significant advantages for low-light video enhancement, primarily due to their high dynamic range. Current research, however, is severely limited by the absence of large-scale, real-world, and spatio-temporally aligned event-video datasets. To address this, we introduce a large-scale dataset with over 30,000 pairs of frames and events captured under varying illumination. This dataset was curated using a robotic arm that traces a consistent non-linear trajectory, achieving spatial alignment precision under 0.03mm and temporal alignment with errors under 0.01s for 90% of the dataset. Based on the dataset, we propose \textbf{EvLight++}, a novel event-guided low-light video enhancement approach designed for robust performance in real-world scenarios. Firstly, we design a multi-scale holistic fusion branch to integrate structural and textural information from both images and events. To counteract variations in regional illumination and noise, we introduce Signal-to-Noise Ratio (SNR)-guided regional feature selection, enhancing features from high SNR regions and augmenting those from low SNR regions by extracting structural information from events. To incorporate temporal information and ensure temporal coherence, we further introduce a recurrent module and temporal loss in the whole pipeline. Extensive experiments on our and the synthetic SDSD dataset demonstrate that EvLight++ significantly outperforms both single image- and video-based methods by 1.37 dB and 3.71 dB, respectively. To further explore its potential in downstream tasks like semantic segmentation and monocular depth estimation, we extend our datasets by adding pseudo segmentation and depth labels via meticulous annotation efforts with foundation models. Experiments under diverse low-light scenes show that the enhanced results achieve a 15.97% improvement in mIoU for semantic segmentation.
comment: Journal extension based on EvLight (arXiv:2404.00834)
☆ Anno-incomplete Multi-dataset Detection
Object detectors have shown outstanding performance on various public datasets. However, annotating a new dataset for a new task is usually unavoidable in real, since 1) a single existing dataset usually does not contain all object categories needed; 2) using multiple datasets usually suffers from annotation incompletion and heterogeneous features. We propose a novel problem as "Annotation-incomplete Multi-dataset Detection", and develop an end-to-end multi-task learning architecture which can accurately detect all the object categories with multiple partially annotated datasets. Specifically, we propose an attention feature extractor which helps to mine the relations among different datasets. Besides, a knowledge amalgamation training strategy is incorporated to accommodate heterogeneous features from different sources. Extensive experiments on different object detection datasets demonstrate the effectiveness of our methods and an improvement of 2.17%, 2.10% in mAP can be achieved on COCO and VOC respectively.
comment: 12 pages, 9 figures
☆ Neural Spectral Decomposition for Dataset Distillation ECCV 2024
In this paper, we propose Neural Spectrum Decomposition, a generic decomposition framework for dataset distillation. Unlike previous methods, we consider the entire dataset as a high-dimensional observation that is low-rank across all dimensions. We aim to discover the low-rank representation of the entire dataset and perform distillation efficiently. Toward this end, we learn a set of spectrum tensors and transformation matrices, which, through simple matrix multiplication, reconstruct the data distribution. Specifically, a spectrum tensor can be mapped back to the image space by a transformation matrix, and efficient information sharing during the distillation learning process is achieved through pairwise combinations of different spectrum vectors and transformation matrices. Furthermore, we integrate a trajectory matching optimization method guided by a real distribution. Our experimental results demonstrate that our approach achieves state-of-the-art performance on benchmarks, including CIFAR10, CIFAR100, Tiny Imagenet, and ImageNet Subset. Our code are available at \url{https://github.com/slyang2021/NSD}.
comment: ECCV 2024
☆ LMT-GP: Combined Latent Mean-Teacher and Gaussian Process for Semi-supervised Low-light Image Enhancement
While recent low-light image enhancement (LLIE) methods have made significant advancements, they still face challenges in terms of low visual quality and weak generalization ability when applied to complex scenarios. To address these issues, we propose a semi-supervised method based on latent mean-teacher and Gaussian process, named LMT-GP. We first design a latent mean-teacher framework that integrates both labeled and unlabeled data, as well as their latent vectors, into model training. Meanwhile, we use a mean-teacher-assisted Gaussian process learning strategy to establish a connection between the latent and pseudo-latent vectors obtained from the labeled and unlabeled data. To guide the learning process, we utilize an assisted Gaussian process regression (GPR) loss function. Furthermore, we design a pseudo-label adaptation module (PAM) to ensure the reliability of the network learning. To demonstrate our method's generalization ability and effectiveness, we apply it to multiple LLIE datasets and high-level vision tasks. Experiment results demonstrate that our method achieves high generalization performance and image quality. The code is available at https://github.com/HFUT-CV/LMT-GP.
☆ PSE-Net: Channel Pruning for Convolutional Neural Networks with Parallel-subnets Estimator
Channel Pruning is one of the most widespread techniques used to compress deep neural networks while maintaining their performances. Currently, a typical pruning algorithm leverages neural architecture search to directly find networks with a configurable width, the key step of which is to identify representative subnet for various pruning ratios by training a supernet. However, current methods mainly follow a serial training strategy to optimize supernet, which is very time-consuming. In this work, we introduce PSE-Net, a novel parallel-subnets estimator for efficient channel pruning. Specifically, we propose a parallel-subnets training algorithm that simulate the forward-backward pass of multiple subnets by droping extraneous features on batch dimension, thus various subnets could be trained in one round. Our proposed algorithm facilitates the efficiency of supernet training and equips the network with the ability to interpolate the accuracy of unsampled subnets, enabling PSE-Net to effectively evaluate and rank the subnets. Over the trained supernet, we develop a prior-distributed-based sampling algorithm to boost the performance of classical evolutionary search. Such algorithm utilizes the prior information of supernet training phase to assist in the search of optimal subnets while tackling the challenge of discovering samples that satisfy resource constraints due to the long-tail distribution of network configuration. Extensive experiments demonstrate PSE-Net outperforms previous state-of-the-art channel pruning methods on the ImageNet dataset while retaining superior supernet training efficiency. For example, under 300M FLOPs constraint, our pruned MobileNetV2 achieves 75.2% Top-1 accuracy on ImageNet dataset, exceeding the original MobileNetV2 by 2.6 units while only cost 30%/16% times than BCNet/AutoAlim.
comment: 10pages, Neural Networks
☆ Enhancing Conditional Image Generation with Explainable Latent Space Manipulation
In the realm of image synthesis, achieving fidelity to a reference image while adhering to conditional prompts remains a significant challenge. This paper proposes a novel approach that integrates a diffusion model with latent space manipulation and gradient-based selective attention mechanisms to address this issue. Leveraging Grad-SAM (Gradient-based Selective Attention Manipulation), we analyze the cross attention maps of the cross attention layers and gradients for the denoised latent vector, deriving importance scores of elements of denoised latent vector related to the subject of interest. Using this information, we create masks at specific timesteps during denoising to preserve subjects while seamlessly integrating the reference image features. This approach ensures the faithful formation of subjects based on conditional prompts, while concurrently refining the background for a more coherent composition. Our experiments on places365 dataset demonstrate promising results, with our proposed model achieving the lowest mean and median Frechet Inception Distance (FID) scores compared to baseline models, indicating superior fidelity preservation. Furthermore, our model exhibits competitive performance in aligning the generated images with provided textual descriptions, as evidenced by high CLIP scores. These results highlight the effectiveness of our approach in both fidelity preservation and textual context preservation, offering a significant advancement in text-to-image synthesis tasks.
comment: 7 pages , 5 figures
☆ Revisiting 360 Depth Estimation with PanoGabor: A New Fusion Perspective
Depth estimation from a monocular 360 image is important to the perception of the entire 3D environment. However, the inherent distortion and large field of view (FoV) in 360 images pose great challenges for this task. To this end, existing mainstream solutions typically introduce additional perspective-based 360 representations (\textit{e.g.}, Cubemap) to achieve effective feature extraction. Nevertheless, regardless of the introduced representations, they eventually need to be unified into the equirectangular projection (ERP) format for the subsequent depth estimation, which inevitably reintroduces the troublesome distortions. In this work, we propose an oriented distortion-aware Gabor Fusion framework (PGFuse) to address the above challenges. First, we introduce Gabor filters that analyze texture in the frequency domain, thereby extending the receptive fields and enhancing depth cues. To address the reintroduced distortions, we design a linear latitude-aware distortion representation method to generate customized, distortion-aware Gabor filters (PanoGabor filters). Furthermore, we design a channel-wise and spatial-wise unidirectional fusion module (CS-UFM) that integrates the proposed PanoGabor filters to unify other representations into the ERP format, delivering effective and distortion-free features. Considering the orientation sensitivity of the Gabor transform, we introduce a spherical gradient constraint to stabilize this sensitivity. Experimental results on three popular indoor 360 benchmarks demonstrate the superiority of the proposed PGFuse to existing state-of-the-art solutions. Code can be available upon acceptance.
☆ LLaVA-SG: Leveraging Scene Graphs as Visual Semantic Expression in Vision-Language Models
Recent advances in large vision-language models (VLMs) typically employ vision encoders based on the Vision Transformer (ViT) architecture. The division of the images into patches by ViT results in a fragmented perception, thereby hindering the visual understanding capabilities of VLMs. In this paper, we propose an innovative enhancement to address this limitation by introducing a Scene Graph Expression (SGE) module in VLMs. This module extracts and structurally expresses the complex semantic information within images, thereby improving the foundational perception and understanding abilities of VLMs. Extensive experiments demonstrate that integrating our SGE module significantly enhances the VLM's performance in vision-language tasks, indicating its effectiveness in preserving intricate semantic details and facilitating better visual understanding. Code and data would be available.
☆ Training-free Video Temporal Grounding using Large-scale Pre-trained Models ECCV 2024
Video temporal grounding aims to identify video segments within untrimmed videos that are most relevant to a given natural language query. Existing video temporal localization models rely on specific datasets for training and have high data collection costs, but they exhibit poor generalization capability under the across-dataset and out-of-distribution (OOD) settings. In this paper, we propose a Training-Free Video Temporal Grounding (TFVTG) approach that leverages the ability of pre-trained large models. A naive baseline is to enumerate proposals in the video and use the pre-trained visual language models (VLMs) to select the best proposal according to the vision-language alignment. However, most existing VLMs are trained on image-text pairs or trimmed video clip-text pairs, making it struggle to (1) grasp the relationship and distinguish the temporal boundaries of multiple events within the same video; (2) comprehend and be sensitive to the dynamic transition of events (the transition from one event to another) in the video. To address these issues, we propose leveraging large language models (LLMs) to analyze multiple sub-events contained in the query text and analyze the temporal order and relationships between these events. Secondly, we split a sub-event into dynamic transition and static status parts and propose the dynamic and static scoring functions using VLMs to better evaluate the relevance between the event and the description. Finally, for each sub-event description, we use VLMs to locate the top-k proposals and leverage the order and relationships between sub-events provided by LLMs to filter and integrate these proposals. Our method achieves the best performance on zero-shot video temporal grounding on Charades-STA and ActivityNet Captions datasets without any training and demonstrates better generalization capabilities in cross-dataset and OOD settings.
comment: Accepted by ECCV 2024
☆ M4CXR: Exploring Multi-task Potentials of Multi-modal Large Language Models for Chest X-ray Interpretation
The rapid evolution of artificial intelligence, especially in large language models (LLMs), has significantly impacted various domains, including healthcare. In chest X-ray (CXR) analysis, previous studies have employed LLMs, but with limitations: either underutilizing the multi-tasking capabilities of LLMs or lacking clinical accuracy. This paper presents M4CXR, a multi-modal LLM designed to enhance CXR interpretation. The model is trained on a visual instruction-following dataset that integrates various task-specific datasets in a conversational format. As a result, the model supports multiple tasks such as medical report generation (MRG), visual grounding, and visual question answering (VQA). M4CXR achieves state-of-the-art clinical accuracy in MRG by employing a chain-of-thought prompting strategy, in which it identifies findings in CXR images and subsequently generates corresponding reports. The model is adaptable to various MRG scenarios depending on the available inputs, such as single-image, multi-image, and multi-study contexts. In addition to MRG, M4CXR performs visual grounding at a level comparable to specialized models and also demonstrates outstanding performance in VQA. Both quantitative and qualitative assessments reveal M4CXR's versatility in MRG, visual grounding, and VQA, while consistently maintaining clinical accuracy.
☆ Uni-3DAD: GAN-Inversion Aided Universal 3D Anomaly Detection on Model-free Products
Anomaly detection is a long-standing challenge in manufacturing systems. Traditionally, anomaly detection has relied on human inspectors. However, 3D point clouds have gained attention due to their robustness to environmental factors and their ability to represent geometric data. Existing 3D anomaly detection methods generally fall into two categories. One compares scanned 3D point clouds with design files, assuming these files are always available. However, such assumptions are often violated in many real-world applications where model-free products exist, such as fresh produce (i.e., ``Cookie", ``Potato", etc.), dentures, bone, etc. The other category compares patches of scanned 3D point clouds with a library of normal patches named memory bank. However, those methods usually fail to detect incomplete shapes, which is a fairly common defect type (i.e., missing pieces of different products). The main challenge is that missing areas in 3D point clouds represent the absence of scanned points. This makes it infeasible to compare the missing region with existing point cloud patches in the memory bank. To address these two challenges, we proposed a unified, unsupervised 3D anomaly detection framework capable of identifying all types of defects on model-free products. Our method integrates two detection modules: a feature-based detection module and a reconstruction-based detection module. Feature-based detection covers geometric defects, such as dents, holes, and cracks, while the reconstruction-based method detects missing regions. Additionally, we employ a One-class Support Vector Machine (OCSVM) to fuse the detection results from both modules. The results demonstrate that (1) our proposed method outperforms the state-of-the-art methods in identifying incomplete shapes and (2) it still maintains comparable performance with the SOTA methods in detecting all other types of anomalies.
☆ PolarBEVDet: Exploring Polar Representation for Multi-View 3D Object Detection in Bird's-Eye-View
Recently, LSS-based multi-view 3D object detection provides an economical and deployment-friendly solution for autonomous driving. However, all the existing LSS-based methods transform multi-view image features into a Cartesian Bird's-Eye-View(BEV) representation, which does not take into account the non-uniform image information distribution and hardly exploits the view symmetry. In this paper, in order to adapt the image information distribution and preserve the view symmetry by regular convolution, we propose to employ the polar BEV representation to substitute the Cartesian BEV representation. To achieve this, we elaborately tailor three modules: a polar view transformer to generate the polar BEV representation, a polar temporal fusion module for fusing historical polar BEV features and a polar detection head to predict the polar-parameterized representation of the object. In addition, we design a 2D auxiliary detection head and a spatial attention enhancement module to improve the quality of feature extraction in perspective view and BEV, respectively. Finally, we integrate the above improvements into a novel multi-view 3D object detector, PolarBEVDet. Experiments on nuScenes show that PolarBEVDet achieves the superior performance. The code is available at https://github.com/Yzichen/PolarBEVDet.git.
comment: 11 pages, 6 figures
☆ DLM-VMTL:A Double Layer Mapper for heterogeneous data video Multi-task prompt learning
In recent years, the parameters of backbones of Video Understanding tasks continue to increase and even reach billion-level. Whether fine-tuning a specific task on the Video Foundation Model or pre-training the model designed for the specific task, incurs a lot of overhead. How to make these models play other values than their own tasks becomes a worthy question. Multi-Task Learning(MTL) makes the visual task acquire the rich shareable knowledge from other tasks while joint training. It is fully explored in Image Recognition tasks especially dense predict tasks. Nevertheless, it is rarely used in video domain due to the lack of multi-labels video data. In this paper, a heterogenous data video multi-task prompt learning (VMTL) method is proposed to address above problem. It's different from it in image domain, a Double-Layers Mapper(DLM) is proposed to extract the shareable knowledge into visual promptS and align it with representation of primary task. Extensive experiments prove that our DLM-VMTL performs better than baselines on 6 different video understanding tasks and 11 datasets.
☆ Estimating Dynamic Flow Features in Groups of Tracked Objects
Interpreting motion captured in image sequences is crucial for a wide range of computer vision applications. Typical estimation approaches include optical flow (OF), which approximates the apparent motion instantaneously in a scene, and multiple object tracking (MOT), which tracks the motion of subjects over time. Often, the motion of objects in a scene is governed by some underlying dynamical system which could be inferred by analyzing the motion of groups of objects. Standard motion analyses, however, are not designed to intuit flow dynamics from trajectory data, making such measurements difficult in practice. The goal of this work is to extend gradient-based dynamical systems analyses to real-world applications characterized by complex, feature-rich image sequences with imperfect tracers. The tracer trajectories are tracked using deep vision networks and gradients are approximated using Lagrangian gradient regression (LGR), a tool designed to estimate spatial gradients from sparse data. From gradients, dynamical features such as regions of coherent rotation and transport barriers are identified. The proposed approach is affordably implemented and enables advanced studies including the motion analysis of two distinct object classes in a single image sequence. Two examples of the method are presented on data sets for which standard gradient-based analyses do not apply.
comment: 21 pages, 6 figures
☆ VLM-KD: Knowledge Distillation from VLM for Long-Tail Visual Recognition
For visual recognition, knowledge distillation typically involves transferring knowledge from a large, well-trained teacher model to a smaller student model. In this paper, we introduce an effective method to distill knowledge from an off-the-shelf vision-language model (VLM), demonstrating that it provides novel supervision in addition to those from a conventional vision-only teacher model. Our key technical contribution is the development of a framework that generates novel text supervision and distills free-form text into a vision encoder. We showcase the effectiveness of our approach, termed VLM-KD, across various benchmark datasets, showing that it surpasses several state-of-the-art long-tail visual classifiers. To our knowledge, this work is the first to utilize knowledge distillation with text supervision generated by an off-the-shelf VLM and apply it to vanilla randomly initialized vision encoders.
☆ Enhancing Autism Spectrum Disorder Early Detection with the Parent-Child Dyads Block-Play Protocol and an Attention-enhanced GCN-xLSTM Hybrid Deep Learning Framework
Autism Spectrum Disorder (ASD) is a rapidly growing neurodevelopmental disorder. Performing a timely intervention is crucial for the growth of young children with ASD, but traditional clinical screening methods lack objectivity. This study introduces an innovative approach to early detection of ASD. The contributions are threefold. First, this work proposes a novel Parent-Child Dyads Block-Play (PCB) protocol, grounded in kinesiological and neuroscientific research, to identify behavioral patterns distinguishing ASD from typically developing (TD) toddlers. Second, we have compiled a substantial video dataset, featuring 40 ASD and 89 TD toddlers engaged in block play with parents. This dataset exceeds previous efforts on both the scale of participants and the length of individual sessions. Third, our approach to action analysis in videos employs a hybrid deep learning framework, integrating a two-stream graph convolution network with attention-enhanced xLSTM (2sGCN-AxLSTM). This framework is adept at capturing dynamic interactions between toddlers and parents by extracting spatial features correlated with upper body and head movements and focusing on global contextual information of action sequences over time. By learning these global features with spatio-temporal correlations, our 2sGCN-AxLSTM effectively analyzes dynamic human behavior patterns and demonstrates an unprecedented accuracy of 89.6\% in early detection of ASD. Our approach shows strong potential for enhancing early ASD diagnosis by accurately analyzing parent-child interactions, providing a critical tool to support timely and informed clinical decision-making.
comment: 18 pages, 8 figures, and 4 tables
☆ Ig3D: Integrating 3D Face Representations in Facial Expression Inference ECCV
Reconstructing 3D faces with facial geometry from single images has allowed for major advances in animation, generative models, and virtual reality. However, this ability to represent faces with their 3D features is not as fully explored by the facial expression inference (FEI) community. This study therefore aims to investigate the impacts of integrating such 3D representations into the FEI task, specifically for facial expression classification and face-based valence-arousal (VA) estimation. To accomplish this, we first assess the performance of two 3D face representations (both based on the 3D morphable model, FLAME) for the FEI tasks. We further explore two fusion architectures, intermediate fusion and late fusion, for integrating the 3D face representations with existing 2D inference frameworks. To evaluate our proposed architecture, we extract the corresponding 3D representations and perform extensive tests on the AffectNet and RAF-DB datasets. Our experimental results demonstrate that our proposed method outperforms the state-of-the-art AffectNet VA estimation and RAF-DB classification tasks. Moreover, our method can act as a complement to other existing methods to boost performance in many emotion inference tasks.
comment: Accepted by ECCVW 2024
☆ Tex-ViT: A Generalizable, Robust, Texture-based dual-branch cross-attention deepfake detector
Deepfakes, which employ GAN to produce highly realistic facial modification, are widely regarded as the prevailing method. Traditional CNN have been able to identify bogus media, but they struggle to perform well on different datasets and are vulnerable to adversarial attacks due to their lack of robustness. Vision transformers have demonstrated potential in the realm of image classification problems, but they require enough training data. Motivated by these limitations, this publication introduces Tex-ViT (Texture-Vision Transformer), which enhances CNN features by combining ResNet with a vision transformer. The model combines traditional ResNet features with a texture module that operates in parallel on sections of ResNet before each down-sampling operation. The texture module then serves as an input to the dual branch of the cross-attention vision transformer. It specifically focuses on improving the global texture module, which extracts feature map correlation. Empirical analysis reveals that fake images exhibit smooth textures that do not remain consistent over long distances in manipulations. Experiments were performed on different categories of FF++, such as DF, f2f, FS, and NT, together with other types of GAN datasets in cross-domain scenarios. Furthermore, experiments also conducted on FF++, DFDCPreview, and Celeb-DF dataset underwent several post-processing situations, such as blurring, compression, and noise. The model surpassed the most advanced models in terms of generalization, achieving a 98% accuracy in cross-domain scenarios. This demonstrates its ability to learn the shared distinguishing textural characteristics in the manipulated samples. These experiments provide evidence that the proposed model is capable of being applied to various situations and is resistant to many post-processing procedures.
☆ LV-UNet: A Lightweight and Vanilla Model for Medical Image Segmentation
Although the progress made by large models in computer vision, optimization challenges, the complexity of transformer models, computational limitations, and the requirements of practical applications call for simpler designs in model architecture for medical image segmentation, especially in mobile medical devices that require lightweight and deployable models with real-time performance. However, some of the current lightweight models exhibit poor robustness across different datasets, which hinders their broader adoption. This paper proposes a lightweight and vanilla model called LV-UNet, which effectively utilizes pre-trained MobileNetv3-Large models and introduces fusible modules. It can be trained using an improved deep training strategy and switched to deployment mode during inference, reducing both parameter count and computational load. Experiments are conducted on ISIC 2016, BUSI, CVC- ClinicDB, CVC-ColonDB, and Kvair-SEG datasets, achieving better performance compared to the state-of-the-art and classic models.
☆ Revising Multimodal VAEs with Diffusion Decoders
Multimodal VAEs often struggle with generating high-quality outputs, a challenge that extends beyond the inherent limitations of the VAE framework. The core issue lies in the restricted joint representation of the latent space, particularly when complex modalities like images are involved. Feedforward decoders, commonly used for these intricate modalities, inadvertently constrain the joint latent space, leading to a degradation in the quality of the other modalities as well. Although recent studies have shown improvement by introducing modality-specific representations, the issue remains significant. In this work, we demonstrate that incorporating a flexible diffusion decoder specifically for the image modality not only enhances the generation quality of the images but also positively impacts the performance of the other modalities that rely on feedforward decoders. This approach addresses the limitations imposed by conventional joint representations and opens up new possibilities for improving multimodal generation tasks using the multimodal VAE framework. Our model provides state-of-the-art results compared to other multimodal VAEs in different datasets with higher coherence and superior quality in the generated modalities
☆ FineFACE: Fair Facial Attribute Classification Leveraging Fine-grained Features
Published research highlights the presence of demographic bias in automated facial attribute classification algorithms, particularly impacting women and individuals with darker skin tones. Existing bias mitigation techniques typically require demographic annotations and often obtain a trade-off between fairness and accuracy, i.e., Pareto inefficiency. Facial attributes, whether common ones like gender or others such as "chubby" or "high cheekbones", exhibit high interclass similarity and intraclass variation across demographics leading to unequal accuracy. This requires the use of local and subtle cues using fine-grained analysis for differentiation. This paper proposes a novel approach to fair facial attribute classification by framing it as a fine-grained classification problem. Our approach effectively integrates both low-level local features (like edges and color) and high-level semantic features (like shapes and structures) through cross-layer mutual attention learning. Here, shallow to deep CNN layers function as experts, offering category predictions and attention regions. An exhaustive evaluation on facial attribute annotated datasets demonstrates that our FineFACE model improves accuracy by 1.32% to 1.74% and fairness by 67% to 83.6%, over the SOTA bias mitigation techniques. Importantly, our approach obtains a Pareto-efficient balance between accuracy and fairness between demographic groups. In addition, our approach does not require demographic annotations and is applicable to diverse downstream classification tasks. To facilitate reproducibility, the code and dataset information is available at https://github.com/VCBSL-Fairness/FineFACE.
☆ MSLIQA: Enhancing Learning Representations for Image Quality Assessment through Multi-Scale Learning
No-Reference Image Quality Assessment (NR-IQA) remains a challenging task due to the diversity of distortions and the lack of large annotated datasets. Many studies have attempted to tackle these challenges by developing more accurate NR-IQA models, often employing complex and computationally expensive networks, or by bridging the domain gap between various distortions to enhance performance on test datasets. In our work, we improve the performance of a generic lightweight NR-IQA model by introducing a novel augmentation strategy that boosts its performance by almost 28\%. This augmentation strategy enables the network to better discriminate between different distortions in various parts of the image by zooming in and out. Additionally, the inclusion of test-time augmentation further enhances performance, making our lightweight network's results comparable to the current state-of-the-art models, simply through the use of augmentations.
☆ GameIR: A Large-Scale Synthesized Ground-Truth Dataset for Image Restoration over Gaming Content
Image restoration methods like super-resolution and image synthesis have been successfully used in commercial cloud gaming products like NVIDIA's DLSS. However, restoration over gaming content is not well studied by the general public. The discrepancy is mainly caused by the lack of ground-truth gaming training data that match the test cases. Due to the unique characteristics of gaming content, the common approach of generating pseudo training data by degrading the original HR images results in inferior restoration performance. In this work, we develop GameIR, a large-scale high-quality computer-synthesized ground-truth dataset to fill in the blanks, targeting at two different applications. The first is super-resolution with deferred rendering, to support the gaming solution of rendering and transferring LR images only and restoring HR images on the client side. We provide 19200 LR-HR paired ground-truth frames coming from 640 videos rendered at 720p and 1440p for this task. The second is novel view synthesis (NVS), to support the multiview gaming solution of rendering and transferring part of the multiview frames and generating the remaining frames on the client side. This task has 57,600 HR frames from 960 videos of 160 scenes with 6 camera views. In addition to the RGB frames, the GBuffers during the deferred rendering stage are also provided, which can be used to help restoration. Furthermore, we evaluate several SOTA super-resolution algorithms and NeRF-based NVS algorithms over our dataset, which demonstrates the effectiveness of our ground-truth GameIR data in improving restoration performance for gaming content. Also, we test the method of incorporating the GBuffers as additional input information for helping super-resolution and NVS. We release our dataset and models to the general public to facilitate research on restoration methods over gaming content.
☆ Comparative Analysis of Transfer Learning Models for Breast Cancer Classification
The classification of histopathological images is crucial for the early and precise detection of breast cancer. This study investigates the efficiency of deep learning models in distinguishing between Invasive Ductal Carcinoma (IDC) and non-IDC in histopathology slides. We conducted a thorough comparison examination of eight sophisticated models: ResNet-50, DenseNet-121, ResNeXt-50, Vision Transformer (ViT), GoogLeNet (Inception v3), EfficientNet, MobileNet, and SqueezeNet. This analysis was carried out using a large dataset of 277,524 image patches. Our research makes a substantial contribution to the field by offering a comprehensive assessment of the performance of each model. We particularly highlight the exceptional efficacy of attention-based mechanisms in the ViT model, which achieved a remarkable validation accuracy of 93\%, surpassing conventional convolutional networks. This study highlights the promise of advanced machine learning approaches in clinical settings, offering improved precision as well as efficiency in breast cancer diagnosis.
☆ Enabling Local Editing in Diffusion Models by Joint and Individual Component Analysis
Recent advances in Diffusion Models (DMs) have led to significant progress in visual synthesis and editing tasks, establishing them as a strong competitor to Generative Adversarial Networks (GANs). However, the latent space of DMs is not as well understood as that of GANs. Recent research has focused on unsupervised semantic discovery in the latent space of DMs by leveraging the bottleneck layer of the denoising network, which has been shown to exhibit properties of a semantic latent space. However, these approaches are limited to discovering global attributes. In this paper we address, the challenge of local image manipulation in DMs and introduce an unsupervised method to factorize the latent semantics learned by the denoising network of pre-trained DMs. Given an arbitrary image and defined regions of interest, we utilize the Jacobian of the denoising network to establish a relation between the regions of interest and their corresponding subspaces in the latent space. Furthermore, we disentangle the joint and individual components of these subspaces to identify latent directions that enable local image manipulation. Once discovered, these directions can be applied to different images to produce semantically consistent edits, making our method suitable for practical applications. Experimental results on various datasets demonstrate that our method can produce semantic edits that are more localized and have better fidelity compared to the state-of-the-art.
comment: Code available here: https://zelaki.github.io/localdiff/
☆ Fluent and Accurate Image Captioning with a Self-Trained Reward Model ICPR 2024
Fine-tuning image captioning models with hand-crafted rewards like the CIDEr metric has been a classical strategy for promoting caption quality at the sequence level. This approach, however, is known to limit descriptiveness and semantic richness and tends to drive the model towards the style of ground-truth sentences, thus losing detail and specificity. On the contrary, recent attempts to employ image-text models like CLIP as reward have led to grammatically incorrect and repetitive captions. In this paper, we propose Self-Cap, a captioning approach that relies on a learnable reward model based on self-generated negatives that can discriminate captions based on their consistency with the image. Specifically, our discriminator is a fine-tuned contrastive image-text model trained to promote caption correctness while avoiding the aberrations that typically happen when training with a CLIP-based reward. To this end, our discriminator directly incorporates negative samples from a frozen captioner, which significantly improves the quality and richness of the generated captions but also reduces the fine-tuning time in comparison to using the CIDEr score as the sole metric for optimization. Experimental results demonstrate the effectiveness of our training strategy on both standard and zero-shot image captioning datasets.
comment: ICPR 2024
☆ See or Guess: Counterfactually Regularized Image Captioning ACM MM 2024
Image captioning, which generates natural language descriptions of the visual information in an image, is a crucial task in vision-language research. Previous models have typically addressed this task by aligning the generative capabilities of machines with human intelligence through statistical fitting of existing datasets. While effective for normal images, they may struggle to accurately describe those where certain parts of the image are obscured or edited, unlike humans who excel in such cases. These weaknesses they exhibit, including hallucinations and limited interpretability, often hinder performance in scenarios with shifted association patterns. In this paper, we present a generic image captioning framework that employs causal inference to make existing models more capable of interventional tasks, and counterfactually explainable. Our approach includes two variants leveraging either total effect or natural direct effect. Integrating them into the training process enables models to handle counterfactual scenarios, increasing their generalizability. Extensive experiments on various datasets show that our method effectively reduces hallucinations and improves the model's faithfulness to images, demonstrating high portability across both small-scale and large-scale image-to-text models. The code is available at https://github.com/Aman-4-Real/See-or-Guess.
comment: Accepted by ACM MM 2024
☆ STEREO: Towards Adversarially Robust Concept Erasing from Text-to-Image Generation Models
The rapid proliferation of large-scale text-to-image generation (T2IG) models has led to concerns about their potential misuse in generating harmful content. Though many methods have been proposed for erasing undesired concepts from T2IG models, they only provide a false sense of security, as recent works demonstrate that concept-erased models (CEMs) can be easily deceived to generate the erased concept through adversarial attacks. The problem of adversarially robust concept erasing without significant degradation to model utility (ability to generate benign concepts) remains an unresolved challenge, especially in the white-box setting where the adversary has access to the CEM. To address this gap, we propose an approach called STEREO that involves two distinct stages. The first stage searches thoroughly enough for strong and diverse adversarial prompts that can regenerate an erased concept from a CEM, by leveraging robust optimization principles from adversarial training. In the second robustly erase once stage, we introduce an anchor-concept-based compositional objective to robustly erase the target concept at one go, while attempting to minimize the degradation on model utility. By benchmarking the proposed STEREO approach against four state-of-the-art concept erasure methods under three adversarial attacks, we demonstrate its ability to achieve a better robustness vs. utility trade-off. Our code and models are available at https://github.com/koushiksrivats/robust-concept-erasing.
comment: Project Page: https://koushiksrivats.github.io/robust-concept-erasing/
♻ ☆ VGBench: Evaluating Large Language Models on Vector Graphics Understanding and Generation
In the realm of vision models, the primary mode of representation is using pixels to rasterize the visual world. Yet this is not always the best or unique way to represent visual content, especially for designers and artists who depict the world using geometry primitives such as polygons. Vector graphics (VG), on the other hand, offer a textual representation of visual content, which can be more concise and powerful for content like cartoons, sketches and scientific figures. Recent studies have shown promising results on processing vector graphics with capable Large Language Models (LLMs). However, such works focus solely on qualitative results, understanding, or a specific type of vector graphics. We propose VGBench, a comprehensive benchmark for LLMs on handling vector graphics through diverse aspects, including (a) both visual understanding and generation, (b) evaluation of various vector graphics formats, (c) diverse question types, (d) wide range of prompting techniques, (e) under multiple LLMs and (f) comparison with VLMs on rasterized representations. Evaluating on our collected 4279 understanding and 5845 generation samples, we find that LLMs show strong capability on both aspects while exhibiting less desirable performance on low-level formats (SVG). Both data and evaluation pipeline will be open-sourced at https://vgbench.github.io.
comment: Project Page: https://vgbench.github.io
♻ ☆ GRAB: A Challenging GRaph Analysis Benchmark for Large Multimodal Models
Large multimodal models (LMMs) have exhibited proficiencies across many visual tasks. Although numerous well-known benchmarks exist to evaluate model performance, they increasingly have insufficient headroom. As such, there is a pressing need for a new generation of benchmarks challenging enough for the next generation of LMMs. One area that LMMs show potential is graph analysis, specifically, the tasks an analyst might typically perform when interpreting figures such as estimating the mean, intercepts or correlations of functions and data series. In this work, we introduce GRAB, a graph analysis benchmark, fit for current and future frontier LMMs. Our benchmark is entirely synthetic, ensuring high-quality, noise-free questions. GRAB is comprised of 2170 questions, covering four tasks and 23 graph properties. We evaluate 20 LMMs on GRAB, finding it to be a challenging benchmark, with the highest performing model attaining a score of just 21.7%. Finally, we conduct various ablations to investigate where the models succeed and struggle. We release GRAB to encourage progress in this important, growing domain.
comment: V2: Fixed references formatting
♻ ☆ Evaluation Framework for Feedback Generation Methods in Skeletal Movement Assessment ECCV 2024
The application of machine-learning solutions to movement assessment from skeleton videos has attracted significant research attention in recent years. This advancement has made rehabilitation at home more accessible, utilizing movement assessment algorithms that can operate on affordable equipment for human pose detection and analysis from 2D or 3D videos. While the primary objective of automatic assessment tasks is to score movements, the automatic generation of feedback highlighting key movement issues has the potential to significantly enhance and accelerate the rehabilitation process. While numerous research works exist in the field of automatic movement assessment, only a handful address feedback generation. In this study, we propose terminology and criteria for the classification, evaluation, and comparison of feedback generation solutions. We discuss the challenges associated with each feedback generation approach and use our proposed criteria to classify existing solutions. To our knowledge, this is the first work that formulates feedback generation in skeletal movement assessment.
comment: Accepted to xAI4Biometrics 2024 at ECCV 2024
♻ ☆ OpticalRS-4M: Scaling Efficient Masked Autoencoder Learning on Large Remote Sensing Dataset
Masked Image Modeling (MIM) has become an essential method for building foundational visual models in remote sensing (RS). However, the limitations in size and diversity of existing RS datasets restrict the ability of MIM methods to learn generalizable representations. Additionally, conventional MIM techniques, which require reconstructing all tokens, introduce unnecessary computational overhead. To address these issues, we present a new pre-training pipeline for RS models, featuring the creation of a large-scale RS dataset and an efficient MIM approach. We curated a high-quality dataset named OpticalRS-4M by collecting publicly available RS datasets and processing them through exclusion, slicing, and deduplication. OpticalRS-4M comprises 4 million optical images covering various RS tasks, such as object detection and pixel segmentation. To enhance efficiency, we propose SelectiveMAE, a pre-training method that dynamically encodes and reconstructs semantically rich patch tokens, thereby reducing the inefficiencies of traditional MIM models caused by redundant background pixels in RS images. Extensive experiments demonstrate that OpticalRS-4M significantly improves classification, detection, and segmentation performance, while SelectiveMAE increases training efficiency over 2 times. This highlights the effectiveness and scalability of our pipeline in developing RS foundational models.
♻ ☆ Manipulate-Anything: Automating Real-World Robots using Vision-Language Models
Large-scale endeavors like and widespread community efforts such as Open-X-Embodiment have contributed to growing the scale of robot demonstration data. However, there is still an opportunity to improve the quality, quantity, and diversity of robot demonstration data. Although vision-language models have been shown to automatically generate demonstration data, their utility has been limited to environments with privileged state information, they require hand-designed skills, and are limited to interactions with few object instances. We propose Manipulate-Anything, a scalable automated generation method for real-world robotic manipulation. Unlike prior work, our method can operate in real-world environments without any privileged state information, hand-designed skills, and can manipulate any static object. We evaluate our method using two setups. First, Manipulate-Anything successfully generates trajectories for all 7 real-world and 14 simulation tasks, significantly outperforming existing methods like VoxPoser. Second, Manipulate-Anything's demonstrations can train more robust behavior cloning policies than training with human demonstrations, or from data generated by VoxPoser, Scaling-up, and Code-As-Policies. We believe Manipulate-Anything can be a scalable method for both generating data for robotics and solving novel tasks in a zero-shot setting. Project page: https://robot-ma.github.io/.
comment: Project page: https://robot-ma.github.io/. All supplementary material, prompts and code can be found on the project page
♻ ☆ Not (yet) the whole story: Evaluating Visual Storytelling Requires More than Measuring Coherence, Grounding, and Repetition
Visual storytelling consists in generating a natural language story given a temporally ordered sequence of images. This task is not only challenging for models, but also very difficult to evaluate with automatic metrics since there is no consensus about what makes a story 'good'. In this paper, we introduce a novel method that measures story quality in terms of human likeness regarding three key aspects highlighted in previous work: visual grounding, coherence, and repetitiveness. We then use this method to evaluate the stories generated by several models, showing that the foundation model LLaVA obtains the best result, but only slightly so compared to TAPM, a 50-times smaller visual storytelling model. Upgrading the visual and language components of TAPM results in a model that yields competitive performance with a relatively low number of parameters. Finally, we carry out a human evaluation study, whose results suggest that a 'good' story may require more than a human-like level of visual grounding, coherence, and repetition.
♻ ☆ Trajectory Forecasting through Low-Rank Adaptation of Discrete Latent Codes
Trajectory forecasting is crucial for video surveillance analytics, as it enables the anticipation of future movements for a set of agents, e.g. basketball players engaged in intricate interactions with long-term intentions. Deep generative models offer a natural learning approach for trajectory forecasting, yet they encounter difficulties in achieving an optimal balance between sampling fidelity and diversity. We address this challenge by leveraging Vector Quantized Variational Autoencoders (VQ-VAEs), which utilize a discrete latent space to tackle the issue of posterior collapse. Specifically, we introduce an instance-based codebook that allows tailored latent representations for each example. In a nutshell, the rows of the codebook are dynamically adjusted to reflect contextual information (i.e., past motion patterns extracted from the observed trajectories). In this way, the discretization process gains flexibility, leading to improved reconstructions. Notably, instance-level dynamics are injected into the codebook through low-rank updates, which restrict the customization of the codebook to a lower dimension space. The resulting discrete space serves as the basis of the subsequent step, which regards the training of a diffusion-based predictive model. We show that such a two-fold framework, augmented with instance-level discretization, leads to accurate and diverse forecasts, yielding state-of-the-art performance on three established benchmarks.
comment: 15 pages, 3 figures, 5 tables
♻ ☆ Verification of Geometric Robustness of Neural Networks via Piecewise Linear Approximation and Lipschitz Optimisation ECAI 2024
We address the problem of verifying neural networks against geometric transformations of the input image, including rotation, scaling, shearing, and translation. The proposed method computes provably sound piecewise linear constraints for the pixel values by using sampling and linear approximations in combination with branch-and-bound Lipschitz optimisation. The method obtains provably tighter over-approximations of the perturbation region than the present state-of-the-art. We report results from experiments on a comprehensive set of verification benchmarks on MNIST and CIFAR10. We show that our proposed implementation resolves up to 32% more verification cases than present approaches.
comment: ECAI 2024
♻ ☆ Smart Multi-Modal Search: Contextual Sparse and Dense Embedding Integration in Adobe Express CIKM 2024
As user content and queries become increasingly multi-modal, the need for effective multi-modal search systems has grown. Traditional search systems often rely on textual and metadata annotations for indexed images, while multi-modal embeddings like CLIP enable direct search using text and image embeddings. However, embedding-based approaches face challenges in integrating contextual features such as user locale and recency. Building a scalable multi-modal search system requires fine-tuning several components. This paper presents a multi-modal search architecture and a series of AB tests that optimize embeddings and multi-modal technologies in Adobe Express template search. We address considerations such as embedding model selection, the roles of embeddings in matching and ranking, and the balance between dense and sparse embeddings. Our iterative approach demonstrates how utilizing sparse, dense, and contextual features enhances short and long query search, significantly reduces null rates (over 70\%), and increases click-through rates (CTR). Our findings provide insights into developing robust multi-modal search systems, thereby enhancing relevance for complex queries.
comment: CIKM 2024 (International Conference on Information and Knowledge Management), Multimodal Search and Recommendations Workshop
♻ ☆ On the Efficacy of Text-Based Input Modalities for Action Anticipation
Anticipating future actions is a highly challenging task due to the diversity and scale of potential future actions; yet, information from different modalities help narrow down plausible action choices. Each modality can provide diverse and often complementary context for the model to learn from. While previous multi-modal methods leverage information from modalities such as video and audio, we primarily explore how text descriptions of actions and objects can also lead to more accurate action anticipation by providing additional contextual cues, e.g., about the environment and its contents. We propose a Multi-modal Contrastive Anticipative Transformer (M-CAT), a video transformer architecture that jointly learns from multi-modal features and text descriptions of actions and objects. We train our model in two stages, where the model first learns to align video clips with descriptions of future actions, and is subsequently fine-tuned to predict future actions. Compared to existing methods, M-CAT has the advantage of learning additional context from two types of text inputs: rich descriptions of future actions during pre-training, and, text descriptions for detected objects and actions during modality feature fusion. Through extensive experimental evaluation, we demonstrate that our model outperforms previous methods on the EpicKitchens datasets, and show that using simple text descriptions of actions and objects aid in more effective action anticipation. In addition, we examine the impact of object and action information obtained via text, and perform extensive ablations.
♻ ☆ Mumpy: Multilateral Temporal-view Pyramid Transformer for Video Inpainting Detection BMVC 2024
The task of video inpainting detection is to expose the pixel-level inpainted regions within a video sequence. Existing methods usually focus on leveraging spatial and temporal inconsistencies. However, these methods typically employ fixed operations to combine spatial and temporal clues, limiting their applicability in different scenarios. In this paper, we introduce a novel Multilateral Temporal-view Pyramid Transformer ({\em MumPy}) that collaborates spatial-temporal clues flexibly. Our method utilizes a newly designed multilateral temporal-view encoder to extract various collaborations of spatial-temporal clues and introduces a deformable window-based temporal-view interaction module to enhance the diversity of these collaborations. Subsequently, we develop a multi-pyramid decoder to aggregate the various types of features and generate detection maps. By adjusting the contribution strength of spatial and temporal clues, our method can effectively identify inpainted regions. We validate our method on existing datasets and also introduce a new challenging and large-scale Video Inpainting dataset based on the YouTube-VOS dataset, which employs several more recent inpainting methods. The results demonstrate the superiority of our method in both in-domain and cross-domain evaluation scenarios.
comment: BMVC 2024
♻ ☆ Generalist Segmentation Algorithm for Photoreceptors Analysis in Adaptive Optics Imaging
Analyzing the cone photoreceptor pattern in images obtained from the living human retina using quantitative methods can be crucial for the early detection and management of various eye conditions. Confocal adaptive optics scanning light ophthalmoscope (AOSLO) imaging enables visualization of the cones from reflections of waveguiding cone photoreceptors. While there have been significant improvements in automated algorithms for segmenting cones in confocal AOSLO images, the process of labelling data remains labor-intensive and manual. This paper introduces a method based on deep learning (DL) for detecting and segmenting cones in AOSLO images. The models were trained on a semi-automatically labelled dataset of 20 AOSLO batches of images of 18 participants for 0$^{\circ}$, 1$^{\circ}$, and 2$^{\circ}$ from the foveal center. F1 scores were 0.968, 0.958, and 0.954 for 0$^{\circ}$, 1$^{\circ}$, and 2$^{\circ}$, respectively, which is better than previously reported DL approaches. Our method minimizes the need for labelled data by only necessitating a fraction of labelled cones, which is especially beneficial in the field of ophthalmology, where labelled data can often be limited.
♻ ☆ Frequency-Assisted Mamba for Remote Sensing Image Super-Resolution
Recent progress in remote sensing image (RSI) super-resolution (SR) has exhibited remarkable performance using deep neural networks, e.g., Convolutional Neural Networks and Transformers. However, existing SR methods often suffer from either a limited receptive field or quadratic computational overhead, resulting in sub-optimal global representation and unacceptable computational costs in large-scale RSI. To alleviate these issues, we develop the first attempt to integrate the Vision State Space Model (Mamba) for RSI-SR, which specializes in processing large-scale RSI by capturing long-range dependency with linear complexity. To achieve better SR reconstruction, building upon Mamba, we devise a Frequency-assisted Mamba framework, dubbed FMSR, to explore the spatial and frequent correlations. In particular, our FMSR features a multi-level fusion architecture equipped with the Frequency Selection Module (FSM), Vision State Space Module (VSSM), and Hybrid Gate Module (HGM) to grasp their merits for effective spatial-frequency fusion. Considering that global and local dependencies are complementary and both beneficial for SR, we further recalibrate these multi-level features for accurate feature fusion via learnable scaling adaptors. Extensive experiments on AID, DOTA, and DIOR benchmarks demonstrate that our FMSR outperforms state-of-the-art Transformer-based methods HAT-L in terms of PSNR by 0.11 dB on average, while consuming only 28.05% and 19.08% of its memory consumption and complexity, respectively. Code will be available at https://github.com/XY-boy/FreMamba
comment: Accepted by IEEE TMM
♻ ☆ On Feasibility of Intent Obfuscating Attacks
Intent obfuscation is a common tactic in adversarial situations, enabling the attacker to both manipulate the target system and avoid culpability. Surprisingly, it has rarely been implemented in adversarial attacks on machine learning systems. We are the first to propose using intent obfuscation to generate adversarial examples for object detectors: by perturbing another non-overlapping object to disrupt the target object, the attacker hides their intended target. We conduct a randomized experiment on 5 prominent detectors -- YOLOv3, SSD, RetinaNet, Faster R-CNN, and Cascade R-CNN -- using both targeted and untargeted attacks and achieve success on all models and attacks. We analyze the success factors characterizing intent obfuscating attacks, including target object confidence and perturb object sizes. We then demonstrate that the attacker can exploit these success factors to increase success rates for all models and attacks. Finally, we discuss main takeaways and legal repercussions.
comment: 33 pages, 21 Figures. Includes technical appendix. To appear in AIES 2024
♻ ☆ VideoMambaPro: A Leap Forward for Mamba in Video Understanding
Video understanding requires the extraction of rich spatio-temporal representations, which transformer models achieve through self-attention. Unfortunately, self-attention poses a computational burden. In NLP, Mamba has surfaced as an efficient alternative for transformers. However, Mamba's successes do not trivially extend to computer vision tasks, including those in video analysis. In this paper, we theoretically analyze the differences between self-attention and Mamba. We identify two limitations in Mamba's token processing: historical decay and element contradiction. We propose VideoMambaPro (VMP) that solves the identified limitations by adding masked backward computation and elemental residual connections to a VideoMamba backbone. VideoMambaPro shows state-of-the-art video action recognition performance compared to transformer models, and surpasses VideoMamba by clear margins: 7.9% and 8.1% top-1 on Kinetics-400 and Something-Something V2, respectively. Our VideoMambaPro-M model achieves 91.9% top-1 on Kinetics-400, only 0.2% below InternVideo2-6B but with only 1.2% of its parameters. The combination of high performance and efficiency makes VideoMambaPro an interesting alternative for transformer models.
comment: Model weights are lost due to management error, will re-calculate and update the results
♻ ☆ Learning to Detect and Segment for Open Vocabulary Object Detection CVPR2023
Open vocabulary object detection has been greatly advanced by the recent development of vision-language pretrained model, which helps recognize novel objects with only semantic categories. The prior works mainly focus on knowledge transferring to the object proposal classification and employ class-agnostic box and mask prediction. In this work, we propose CondHead, a principled dynamic network design to better generalize the box regression and mask segmentation for open vocabulary setting. The core idea is to conditionally parameterize the network heads on semantic embedding and thus the model is guided with class-specific knowledge to better detect novel categories. Specifically, CondHead is composed of two streams of network heads, the dynamically aggregated head and the dynamically generated head. The former is instantiated with a set of static heads that are conditionally aggregated, these heads are optimized as experts and are expected to learn sophisticated prediction. The latter is instantiated with dynamically generated parameters and encodes general class-specific information. With such a conditional design, the detection model is bridged by the semantic embedding to offer strongly generalizable class-wise box and mask prediction. Our method brings significant improvement to the state-of-the-art open vocabulary object detection methods with very minor overhead, e.g., it surpasses a RegionClip model by 3.0 detection AP on novel categories, with only 1.1% more computation.
comment: Accepted to CVPR2023, code will be available later
♻ ☆ Fast Text-to-3D-Aware Face Generation and Manipulation via Direct Cross-modal Mapping and Geometric Regularization
Text-to-3D-aware face (T3D Face) generation and manipulation is an emerging research hot spot in machine learning, which still suffers from low efficiency and poor quality. In this paper, we propose an End-to-End Efficient and Effective network for fast and accurate T3D face generation and manipulation, termed $E^3$-FaceNet. Different from existing complex generation paradigms, $E^3$-FaceNet resorts to a direct mapping from text instructions to 3D-aware visual space. We introduce a novel Style Code Enhancer to enhance cross-modal semantic alignment, alongside an innovative Geometric Regularization objective to maintain consistency across multi-view generations. Extensive experiments on three benchmark datasets demonstrate that $E^3$-FaceNet can not only achieve picture-like 3D face generation and manipulation, but also improve inference speed by orders of magnitudes. For instance, compared with Latent3D, $E^3$-FaceNet speeds up the five-view generations by almost 470 times, while still exceeding in generation quality. Our code is released at https://github.com/Aria-Zhangjl/E3-FaceNet.
♻ ☆ HDRTransDC: High Dynamic Range Image Reconstruction with Transformer Deformation Convolution
High Dynamic Range (HDR) imaging aims to generate an artifact-free HDR image with realistic details by fusing multi-exposure Low Dynamic Range (LDR) images. Caused by large motion and severe under-/over-exposure among input LDR images, HDR imaging suffers from ghosting artifacts and fusion distortions. To address these critical issues, we propose an HDR Transformer Deformation Convolution (HDRTransDC) network to generate high-quality HDR images, which consists of the Transformer Deformable Convolution Alignment Module (TDCAM) and the Dynamic Weight Fusion Block (DWFB). To solve the ghosting artifacts, the proposed TDCAM extracts long-distance content similar to the reference feature in the entire non-reference features, which can accurately remove misalignment and fill the content occluded by moving objects. For the purpose of eliminating fusion distortions, we propose DWFB to spatially adaptively select useful information across frames to effectively fuse multi-exposed features. Extensive experiments show that our method quantitatively and qualitatively achieves state-of-the-art performance.
comment: We request to withdraw our manuscript due to identified issues: inaccuracies in the description of a submodule's composition, principles, and functionality in Section 3.2, and potential problems in metric calculation in Sections 4.2 and 4.3. To prevent the spread of misleading information, we believe it is necessary to temporarily withdraw the manuscript for further research and verification
♻ ☆ A comparison between humans and AI at recognizing objects in unusual poses
Deep learning is closing the gap with human vision on several object recognition benchmarks. Here we investigate this gap for challenging images where objects are seen in unusual poses. We find that humans excel at recognizing objects in such poses. In contrast, state-of-the-art deep networks for vision (EfficientNet, SWAG, ViT, SWIN, BEiT, ConvNext) and state-of-the-art large vision-language models (Claude 3.5, Gemini 1.5, GPT-4) are systematically brittle on unusual poses, with the exception of Gemini showing excellent robustness in that condition. As we limit image exposure time, human performance degrades to the level of deep networks, suggesting that additional mental processes (requiring additional time) are necessary to identify objects in unusual poses. An analysis of error patterns of humans vs. networks reveals that even time-limited humans are dissimilar to feed-forward deep networks. In conclusion, our comparison reveals that humans and deep networks rely on different mechanisms for recognizing objects in unusual poses. Understanding the nature of the mental processes taking place during extra viewing time may be key to reproduce the robustness of human vision in silico.
♻ ☆ Sparse-Tuning: Adapting Vision Transformers with Efficient Fine-tuning and Inference
Parameter-efficient fine-tuning (PEFT) has emerged as a popular solution for adapting pre-trained Vision Transformer (ViT) models to downstream applications. While current PEFT methods have achieved parameter efficiency, they overlook the efficiency of computation and GPU memory during both fine-tuning and inference, falling short of practical requirements. In this paper, we propose \textbf{Sparse-Tuning}, a novel PEFT method that accounts for the information redundancy in images and videos to boost the above efficiency. By sparsely preserving the semantic-relevant tokens and merging irrelevant ones, Sparse-Tuning minimizes the quantity of tokens processed at each layer, leading to a quadratic reduction in computational and memory overhead. To align our token sparsification strategy suitably with fine-tuning purposes, we further design Dense Adapters that establish dense connections from shallow layers to deeper layers. These Dense Adapters integrate multi-level local features to enrich the current tokens, improving both token preservation and model adaptation. Empirical results on VTAB-1K, three image datasets, and two video datasets show that our Sparse-Tuning reduces GFLOPs to \textbf{62\%-70\%} of the original ViT-B while achieving state-of-the-art performance. Source code is available at \url{https://github.com/liuting20/Sparse-Tuning}.
♻ ☆ Deep Learning Based Speckle Filtering for Polarimetric SAR Images. Application to Sentinel-1
Speckle suppression in synthetic aperture radar (SAR) images is a key processing step which continues to be a research topic. A wide variety of methods, using either spatially-based approaches or transform-based strategies, have been developed and have shown to provide outstanding results. However, recent advances in deep learning techniques and their application to SAR image despeckling have been demonstrated to offer state-of-the-art results. Unfortunately, they have been mostly applied to single-polarimetric images. The extension of a deep learning-based approach for speckle removal to polarimetric SAR (PolSAR) images is complicated because of the complex nature of the measured covariance matrices for every image pixel, the properties of which must be preserved during filtering. In this work, we propose a complete framework to remove speckle in polarimetric SAR images using a convolutional neural network. The methodology includes a reversible transformation of the original complex covariance matrix to obtain a set of real-valued intensity bands which are fed to the neural network. In addition, the proposed method includes a change detection strategy to avoid the neural network to learn erroneous features in areas strongly affected by temporal changes, so that the network only learns the underlying speckle component present in the data. The method is implemented and tested with dual-polarimetric images acquired by Sentinel-1. Experiments show that the proposed approach offers exceptional results in both speckle reduction and resolution preservation. More importantly, it is also shown that the neural network is not generating artifacts or introducing bias in the filtered images, making them suitable for further polarimetric processing and exploitation.
comment: 23 pages, 32 figures
♻ ☆ Dual-Domain CLIP-Assisted Residual Optimization Perception Model for Metal Artifact Reduction
Metal artifacts in computed tomography (CT) imaging pose significant challenges to accurate clinical diagnosis. The presence of high-density metallic implants results in artifacts that deteriorate image quality, manifesting in the forms of streaking, blurring, or beam hardening effects, etc. Nowadays, various deep learning-based approaches, particularly generative models, have been proposed for metal artifact reduction (MAR). However, these methods have limited perception ability in the diverse morphologies of different metal implants with artifacts, which may generate spurious anatomical structures and exhibit inferior generalization capability. To address the issues, we leverage visual-language model (VLM) to identify these morphological features and introduce them into a dual-domain CLIP-assisted residual optimization perception model (DuDoCROP) for MAR. Specifically, a dual-domain CLIP (DuDoCLIP) is fine-tuned on the image domain and sinogram domain using contrastive learning to extract semantic descriptions from anatomical structures and metal artifacts. Subsequently, a diffusion model is guided by the embeddings of DuDoCLIP, thereby enabling the dual-domain prior generation. Additionally, we design prompt engineering for more precise image-text descriptions that can enhance the model's perception capability. Then, a downstream task is devised for the one-step residual optimization and integration of dual-domain priors, while incorporating raw data fidelity. Ultimately, a new perceptual indicator is proposed to validate the model's perception and generation performance. With the assistance of DuDoCLIP, our DuDoCROP exhibits at least 63.7% higher generalization capability compared to the baseline model. Numerical experiments demonstrate that the proposed method can generate more realistic image structures and outperform other SOTA approaches both qualitatively and quantitatively.
comment: 14 pages, 18 figures
♻ ☆ Enhancing Adaptive Deep Networks for Image Classification via Uncertainty-aware Decision Fusion
Handling varying computational resources is a critical issue in modern AI applications. Adaptive deep networks, featuring the dynamic employment of multiple classifier heads among different layers, have been proposed to address classification tasks under varying computing resources. Existing approaches typically utilize the last classifier supported by the available resources for inference, as they believe that the last classifier always performs better across all classes. However, our findings indicate that earlier classifier heads can outperform the last head for certain classes. Based on this observation, we introduce the Collaborative Decision Making (CDM) module, which fuses the multiple classifier heads to enhance the inference performance of adaptive deep networks. CDM incorporates an uncertainty-aware fusion method based on evidential deep learning (EDL), that utilizes the reliability (uncertainty values) from the first c-1 classifiers to improve the c-th classifier' accuracy. We also design a balance term that reduces fusion saturation and unfairness issues caused by EDL constraints to improve the fusion quality of CDM. Finally, a regularized training strategy that uses the last classifier to guide the learning process of early classifiers is proposed to further enhance the CDM module's effect, called the Guided Collaborative Decision Making (GCDM) framework. The experimental evaluation demonstrates the effectiveness of our approaches. Results on ImageNet datasets show CDM and GCDM obtain 0.4% to 2.8% accuracy improvement (under varying computing resources) on popular adaptive networks. The code is available at the link https://github.com/Meteor-Stars/GCDM_AdaptiveNet.
comment: 13 pages, 27 figures. In ACM Multimedia 2024
♻ ☆ A Flying Bird Object Detection Method for Surveillance Video
Aiming at the specific characteristics of flying bird objects in surveillance video, such as the typically non-obvious features in single-frame images, small size in most instances, and asymmetric shapes, this paper proposes a Flying Bird Object Detection method for Surveillance Video (FBOD-SV). Firstly, a new feature aggregation module, the Correlation Attention Feature Aggregation (Co-Attention-FA) module, is designed to aggregate the features of the flying bird object according to the bird object's correlation on multiple consecutive frames of images. Secondly, a Flying Bird Object Detection Network (FBOD-Net) with down-sampling followed by up-sampling is designed, which utilizes a large feature layer that fuses fine spatial information and large receptive field information to detect special multi-scale (mostly small-scale) bird objects. Finally, the SimOTA dynamic label allocation method is applied to One-Category object detection, and the SimOTA-OC dynamic label strategy is proposed to solve the difficult problem of label allocation caused by irregular flying bird objects. In this paper, the performance of the FBOD-SV is validated using experimental datasets of flying bird objects in traction substation surveillance videos. The experimental results show that the FBOD-SV effectively improves the detection performance of flying bird objects in surveillance video.
♻ ☆ Conformal Performance Range Prediction for Segmentation Output Quality Control MICCAI
Recent works have introduced methods to estimate segmentation performance without ground truth, relying solely on neural network softmax outputs. These techniques hold potential for intuitive output quality control. However, such performance estimates rely on calibrated softmax outputs, which is often not the case in modern neural networks. Moreover, the estimates do not take into account inherent uncertainty in segmentation tasks. These limitations may render precise performance predictions unattainable, restricting the practical applicability of performance estimation methods. To address these challenges, we develop a novel approach for predicting performance ranges with statistical guarantees of containing the ground truth with a user specified probability. Our method leverages sampling-based segmentation uncertainty estimation to derive heuristic performance ranges, and applies split conformal prediction to transform these estimates into rigorous prediction ranges that meet the desired guarantees. We demonstrate our approach on the FIVES retinal vessel segmentation dataset and compare five commonly used sampling-based uncertainty estimation techniques. Our results show that it is possible to achieve the desired coverage with small prediction ranges, highlighting the potential of performance range prediction as a valuable tool for output quality control.
comment: Accepted as an oral presentation at MICCAI UNSURE 2024
♻ ☆ Shot Segmentation Based on Von Neumann Entropy for Key Frame Extraction
Video key frame extraction is important in various fields, such as video summary, retrieval, and compression. Therefore, we suggest a video key frame extraction algorithm based on shot segmentation using Von Neumann entropy. The segmentation of shots is achieved through the computation of Von Neumann entropy of the similarity matrix among frames within the video sequence. The initial frame of each shot is selected as key frames, which combines the temporal sequence information of frames. The experimental results show the extracted key frames can fully and accurately represent the original video content while minimizing the number of repeated frames.
comment: 14 pages, 5 figures
♻ ☆ Spatio-Temporal Context Prompting for Zero-Shot Action Detection
Spatio-temporal action detection encompasses the tasks of localizing and classifying individual actions within a video. Recent works aim to enhance this process by incorporating interaction modeling, which captures the relationship between people and their surrounding context. However, these approaches have primarily focused on fully-supervised learning, and the current limitation lies in the lack of generalization capability to recognize unseen action categories. In this paper, we aim to adapt the pretrained image-language models to detect unseen actions. To this end, we propose a method which can effectively leverage the rich knowledge of visual-language models to perform Person-Context Interaction. Meanwhile, our Context Prompting module will utilize contextual information to prompt labels, thereby enhancing the generation of more representative text features. Moreover, to address the challenge of recognizing distinct actions by multiple people at the same timestamp, we design the Interest Token Spotting mechanism which employs pretrained visual knowledge to find each person's interest context tokens, and then these tokens will be used for prompting to generate text features tailored to each individual. To evaluate the ability to detect unseen actions, we propose a comprehensive benchmark on J-HMDB, UCF101-24, and AVA datasets. The experiments show that our method achieves superior results compared to previous approaches and can be further extended to multi-action videos, bringing it closer to real-world applications. The code and data can be found in https://webber2933.github.io/ST-CLIP-project-page.
comment: Project page: https://webber2933.github.io/ST-CLIP-project-page
♻ ☆ Text-Region Matching for Multi-Label Image Recognition with Missing Labels ACM MM
Recently, large-scale visual language pre-trained (VLP) models have demonstrated impressive performance across various downstream tasks. Motivated by these advancements, pioneering efforts have emerged in multi-label image recognition with missing labels, leveraging VLP prompt-tuning technology. However, they usually cannot match text and vision features well, due to complicated semantics gaps and missing labels in a multi-label image. To tackle this challenge, we propose $\textbf{T}$ext-$\textbf{R}$egion $\textbf{M}$atching for optimizing $\textbf{M}$ulti-$\textbf{L}$abel prompt tuning, namely TRM-ML, a novel method for enhancing meaningful cross-modal matching. Compared to existing methods, we advocate exploring the information of category-aware regions rather than the entire image or pixels, which contributes to bridging the semantic gap between textual and visual representations in a one-to-one matching manner. Concurrently, we further introduce multimodal contrastive learning to narrow the semantic gap between textual and visual modalities and establish intra-class and inter-class relationships. Additionally, to deal with missing labels, we propose a multimodal category prototype that leverages intra- and inter-category semantic relationships to estimate unknown labels, facilitating pseudo-label generation. Extensive experiments on the MS-COCO, PASCAL VOC, Visual Genome, NUS-WIDE, and CUB-200-211 benchmark datasets demonstrate that our proposed framework outperforms the state-of-the-art methods by a significant margin. Our code is available here: https://github.com/yu-gi-oh-leilei/TRM-ML.
comment: Accepted to ACM International Conference on Multimedia (ACM MM) 2024
♻ ☆ Erasing Concepts from Text-to-Image Diffusion Models with Few-shot Unlearning BMVC2024
Generating images from text has become easier because of the scaling of diffusion models and advancements in the field of vision and language. These models are trained using vast amounts of data from the Internet. Hence, they often contain undesirable content such as copyrighted material. As it is challenging to remove such data and retrain the models, methods for erasing specific concepts from pre-trained models have been investigated. We propose a novel concept-erasure method that updates the text encoder using few-shot unlearning in which a few real images are used. The discussion regarding the generated images after erasing a concept has been lacking. While there are methods for specifying the transition destination for concepts, the validity of the specified concepts is unclear. Our method implicitly achieves this by transitioning to the latent concepts inherent in the model or the images. Our method can erase a concept within 10 s, making concept erasure more accessible than ever before. Implicitly transitioning to related concepts leads to more natural concept erasure. We applied the proposed method to various concepts and confirmed that concept erasure can be achieved tens to hundreds of times faster than with current methods. By varying the parameters to be updated, we obtained results suggesting that, like previous research, knowledge is primarily accumulated in the feed-forward networks of the text encoder. Our code is available at \url{https://github.com/fmp453/few-shot-erasing}
comment: 25 pages, 28 figures, accepted by BMVC2024
♻ ☆ CorMulT: A Semi-supervised Modality Correlation-aware Multimodal Transformer for Sentiment Analysis
Multimodal sentiment analysis is an active research area that combines multiple data modalities, e.g., text, image and audio, to analyze human emotions and benefits a variety of applications. Existing multimodal sentiment analysis methods can be classified as modality interaction-based methods, modality transformation-based methods and modality similarity-based methods. However, most of these methods highly rely on the strong correlations between modalities, and cannot fully uncover and utilize the correlations between modalities to enhance sentiment analysis. Therefore, these methods usually achieve bad performance for identifying the sentiment of multimodal data with weak correlations. To address this issue, we proposed a two-stage semi-supervised model termed Correlation-aware Multimodal Transformer (CorMulT) which consists pre-training stage and prediction stage. At the pre-training stage, a modality correlation contrastive learning module is designed to efficiently learn modality correlation coefficients between different modalities. At the prediction stage, the learned correlation coefficients are fused with modality representations to make the sentiment prediction. According to the experiments on the popular multimodal dataset CMU-MOSEI, CorMulT obviously surpasses state-of-the-art multimodal sentiment analysis methods.
♻ ☆ KeyMatchNet: Zero-Shot Pose Estimation in 3D Point Clouds by Generalized Keypoint Matching
In this paper, we present KeyMatchNet, a novel network for zero-shot pose estimation in 3D point clouds. Our method uses only depth information, making it more applicable for many industrial use cases, as color information is seldom available. The network is composed of two parallel components for computing object and scene features. The features are then combined to create matches used for pose estimation. The parallel structure allows for pre-processing of the individual parts, which decreases the run-time. Using a zero-shot network allows for a very short set-up time, as it is not necessary to train models for new objects. However, as the network is not trained for the specific object, zero-shot pose estimation methods generally have lower accuracy compared with conventional methods. To address this, we reduce the complexity of the task by including the scenario information during training. This is typically not feasible as collecting real data for new tasks drastically increases the cost. However, for zero-shot pose estimation, training for new objects is not necessary and the expensive data collection can thus be performed only once. Our method is trained on 1,500 objects and is only tested on unseen objects. We demonstrate that the trained network can not only accurately estimate poses for novel objects, but also demonstrate the ability of the network on objects outside of the trained class. Test results are also shown on real data. We believe that the presented method is valuable for many real-world scenarios. Project page available at keymatchnet.github.io
comment: 8 pages, 6 figures, 5 tables
♻ ☆ Satellite Sunroof: High-res Digital Surface Models and Roof Segmentation for Global Solar Mapping
The transition to renewable energy, particularly solar, is key to mitigating climate change. Google's Solar API aids this transition by estimating solar potential from aerial imagery, but its impact is constrained by geographical coverage. This paper proposes expanding the API's reach using satellite imagery, enabling global solar potential assessment. We tackle challenges involved in building a Digital Surface Model (DSM) and roof instance segmentation from lower resolution and single oblique views using deep learning models. Our models, trained on aligned satellite and aerial datasets, produce 25cm DSMs and roof segments. With ~1m DSM MAE on buildings, ~5deg roof pitch error and ~56% IOU on roof segmentation, they significantly enhance the Solar API's potential to promote solar adoption.
comment: 14 pages
♻ ☆ RIDE: Boosting 3D Object Detection for LiDAR Point Clouds via Rotation-Invariant Analysis
The rotation robustness property has drawn much attention to point cloud analysis, whereas it still poses a critical challenge in 3D object detection. When subjected to arbitrary rotation, most existing detectors fail to produce expected outputs due to the poor rotation robustness. In this paper, we present RIDE, a pioneering exploration of Rotation-Invariance for the 3D LiDAR-point-based object DEtector, with the key idea of designing rotation-invariant features from LiDAR scenes and then effectively incorporating them into existing 3D detectors. Specifically, we design a bi-feature extractor that extracts (i) object-aware features though sensitive to rotation but preserve geometry well, and (ii) rotation-invariant features, which lose geometric information to a certain extent but are robust to rotation. These two kinds of features complement each other to decode 3D proposals that are robust to arbitrary rotations. Particularly, our RIDE is compatible and easy to plug into the existing one-stage and two-stage 3D detectors, and boosts both detection performance and rotation robustness. Extensive experiments on the standard benchmarks showcase that the mean average precision (mAP) and rotation robustness can be significantly boosted by integrating with our RIDE, with +5.6% mAP and 53% rotation robustness improvement on KITTI, +5.1% and 28% improvement correspondingly on nuScenes. The code will be available soon.
♻ ☆ Content Significance Distribution of Sub-Text Blocks in Articles and Its Application to Article-Organization Assessment
We explore how to capture the significance of a sub-text block in an article and how it may be used for text mining tasks. A sub-text block is a sub-sequence of sentences in the article. We formulate the notion of content significance distribution (CSD) of sub-text blocks, referred to as CSD of the first kind and denoted by CSD-1. In particular, we leverage Hugging Face's SentenceTransformer to generate contextual sentence embeddings, and use MoverScore over text embeddings to measure how similar a sub-text block is to the entire text. To overcome the exponential blowup on the number of sub-text blocks, we present an approximation algorithm and show that the approximated CSD-1 is almost identical to the exact CSD-1. Under this approximation, we show that the average and median CSD-1's for news, scholarly research, argument, and narrative articles share the same pattern. We also show that under a certain linear transformation, the complement of the cumulative distribution function of the beta distribution with certain values of $\alpha$ and $\beta$ resembles a CSD-1 curve. We then use CSD-1's to extract linguistic features to train an SVC classifier for assessing how well an article is organized. Through experiments, we show that this method achieves high accuracy for assessing student essays. Moreover, we study CSD of sentence locations, referred to as CSD of the second kind and denoted by CSD-2, and show that average CSD-2's for different types of articles possess distinctive patterns, which either conform common perceptions of article structures or provide rectification with minor deviation.
♻ ☆ SegVol: Universal and Interactive Volumetric Medical Image Segmentation
Precise image segmentation provides clinical study with instructive information. Despite the remarkable progress achieved in medical image segmentation, there is still an absence of a 3D foundation segmentation model that can segment a wide range of anatomical categories with easy user interaction. In this paper, we propose a 3D foundation segmentation model, named SegVol, supporting universal and interactive volumetric medical image segmentation. By scaling up training data to 90K unlabeled Computed Tomography (CT) volumes and 6K labeled CT volumes, this foundation model supports the segmentation of over 200 anatomical categories using semantic and spatial prompts. To facilitate efficient and precise inference on volumetric images, we design a zoom-out-zoom-in mechanism. Extensive experiments on 22 anatomical segmentation tasks verify that SegVol outperforms the competitors in 19 tasks, with improvements up to 37.24% compared to the runner-up methods. We demonstrate the effectiveness and importance of specific designs by ablation study. We expect this foundation model can promote the development of volumetric medical image analysis. The model and code are publicly available at: https://github.com/BAAI-DCAI/SegVol.
♻ ☆ DiffiT: Diffusion Vision Transformers for Image Generation ECCV'24
Diffusion models with their powerful expressivity and high sample quality have achieved State-Of-The-Art (SOTA) performance in the generative domain. The pioneering Vision Transformer (ViT) has also demonstrated strong modeling capabilities and scalability, especially for recognition tasks. In this paper, we study the effectiveness of ViTs in diffusion-based generative learning and propose a new model denoted as Diffusion Vision Transformers (DiffiT). Specifically, we propose a methodology for finegrained control of the denoising process and introduce the Time-dependant Multihead Self Attention (TMSA) mechanism. DiffiT is surprisingly effective in generating high-fidelity images with significantly better parameter efficiency. We also propose latent and image space DiffiT models and show SOTA performance on a variety of class-conditional and unconditional synthesis tasks at different resolutions. The Latent DiffiT model achieves a new SOTA FID score of 1.73 on ImageNet256 dataset while having 19.85%, 16.88% less parameters than other Transformer-based diffusion models such as MDT and DiT,respectively. Code: https://github.com/NVlabs/DiffiT
comment: Accepted to ECCV'24
♻ ☆ EaDeblur-GS: Event assisted 3D Deblur Reconstruction with Gaussian Splatting
3D deblurring reconstruction techniques have recently seen significant advancements with the development of Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS). Although these techniques can recover relatively clear 3D reconstructions from blurry image inputs, they still face limitations in handling severe blurring and complex camera motion. To address these issues, we propose Event-assisted 3D Deblur Reconstruction with Gaussian Splatting (EaDeblur-GS), which integrates event camera data to enhance the robustness of 3DGS against motion blur. By employing an Adaptive Deviation Estimator (ADE) network to estimate Gaussian center deviations and using novel loss functions, EaDeblur-GS achieves sharp 3D reconstructions in real-time, demonstrating performance comparable to state-of-the-art methods.
♻ ☆ Pre-training on Synthetic Driving Data for Trajectory Prediction
Accumulating substantial volumes of real-world driving data proves pivotal in the realm of trajectory forecasting for autonomous driving. Given the heavy reliance of current trajectory forecasting models on data-driven methodologies, we aim to tackle the challenge of learning general trajectory forecasting representations under limited data availability. We propose a pipeline-level solution to mitigate the issue of data scarcity in trajectory forecasting. The solution is composed of two parts: firstly, we adopt HD map augmentation and trajectory synthesis for generating driving data, and then we learn representations by pre-training on them. Specifically, we apply vector transformations to reshape the maps, and then employ a rule-based model to generate trajectories on both original and augmented scenes; thus enlarging the driving data without collecting additional real ones. To foster the learning of general representations within this augmented dataset, we comprehensively explore the different pre-training strategies, including extending the concept of a Masked AutoEncoder (MAE) for trajectory forecasting. Without bells and whistles, our proposed pipeline-level solution is general, simple, yet effective: we conduct extensive experiments to demonstrate the effectiveness of our data expansion and pre-training strategies, which outperform the baseline prediction model by large margins, e.g. 5.04%, 3.84% and 8.30% in terms of $MR_6$, $minADE_6$ and $minFDE_6$. The pre-training dataset and the codes for pre-training and fine-tuning are released at https://github.com/yhli123/Pretraining_on_Synthetic_Driving_Data_for_Trajectory_Prediction.
♻ ☆ Improving Deep Representation Learning via Auxiliary Learnable Target Coding
Deep representation learning is a subfield of machine learning that focuses on learning meaningful and useful representations of data through deep neural networks. However, existing methods for semantic classification typically employ pre-defined target codes such as the one-hot and the Hadamard codes, which can either fail or be less flexible to model inter-class correlation. In light of this, this paper introduces a novel learnable target coding as an auxiliary regularization of deep representation learning, which can not only incorporate latent dependency across classes but also impose geometric properties of target codes into representation space. Specifically, a margin-based triplet loss and a correlation consistency loss on the proposed target codes are designed to encourage more discriminative representations owing to enlarging between-class margins in representation space and favoring equal semantic correlation of learnable target codes respectively. Experimental results on several popular visual classification and retrieval benchmarks can demonstrate the effectiveness of our method on improving representation learning, especially for imbalanced data. Source codes are made publicly available at \href{https://github.com/AkonLau/LTC}{https://github.com/AkonLau/LTC}.
comment: Accepted by Pattern Recognition, 33 pages, 8 figures, 11 tables
♻ ☆ SITransformer: Shared Information-Guided Transformer for Extreme Multimodal Summarization
Extreme Multimodal Summarization with Multimodal Output (XMSMO) becomes an attractive summarization approach by integrating various types of information to create extremely concise yet informative summaries for individual modalities. Existing methods overlook the issue that multimodal data often contains more topic irrelevant information, which can mislead the model into producing inaccurate summaries especially for extremely short ones. In this paper, we propose SITransformer, a Shared Information-guided Transformer for extreme multimodal summarization. It has a shared information guided pipeline which involves a cross-modal shared information extractor and a cross-modal interaction module. The extractor formulates semantically shared salient information from different modalities by devising a novel filtering process consisting of a differentiable top-k selector and a shared-information guided gating unit. As a result, the common, salient, and relevant contents across modalities are identified. Next, a transformer with cross-modal attentions is developed for intra- and inter-modality learning with the shared information guidance to produce the extreme summary. Comprehensive experiments demonstrate that SITransformer significantly enhances the summarization quality for both video and text summaries for XMSMO. Our code will be publicly available at https://github.com/SichengLeoLiu/MMAsia24-XMSMO.
comment: 8 pages, 5 figures, submitted to ACM Multimedia Asia 2024
♻ ☆ Hierarchical Spatial Proximity Reasoning for Vision-and-Language Navigation
Most Vision-and-Language Navigation (VLN) algorithms are prone to making decision due to a lack of visual common sense and insufficient reasoning capabilities. To address this issue, we propose a Hierarchical Spatial Proximity Reasoning (HSPR) method. First, we introduce a scene understanding auxiliary task to help the agent build a knowledge base of hierarchical spatial proximity. This task utilizes panoramic views and object features to identify types of nodes and uncover the adjacency relationships between nodes, objects, and between nodes and objects. Second, we propose a multi-step reasoning navigation algorithm based on hierarchical spatial proximity knowledge base, which continuously plans feasible paths to enhance exploration efficiency. Third, we introduce a residual fusion method to improve navigation decision accuracy. Finally, we validate our approach with experiments on publicly available datasets including REVERIE, SOON, R2R, and R4R. Our code is available at https://github.com/iCityLab/HSPR.
♻ ☆ DuoSpaceNet: Leveraging Both Bird's-Eye-View and Perspective View Representations for 3D Object Detection
Recent advances in multi-view camera-only 3D object detection either rely on an accurate reconstruction of bird's-eye-view (BEV) 3D features or on traditional 2D perspective view (PV) image features. While both have their own pros and cons, few have found a way to stitch them together in order to benefit from "the best of both worlds". To this end, we explore a duo space (i.e., BEV and PV) 3D perception framework, in conjunction with some useful duo space fusion strategies that allow effective aggregation of the two feature representations. To the best of our knowledge, our proposed method, DuoSpaceNet, is the first to leverage two distinct feature spaces and achieves the state-of-the-art 3D object detection and BEV map segmentation results on nuScenes dataset.
♻ ☆ TiCoSS: Tightening the Coupling between Semantic Segmentation and Stereo Matching within A Joint Learning Framework
Semantic segmentation and stereo matching, respectively analogous to the ventral and dorsal streams in our human brain, are two key components of autonomous driving perception systems. Addressing these two tasks with separate networks is no longer the mainstream direction in developing computer vision algorithms, particularly with the recent advances in large vision models and embodied artificial intelligence. The trend is shifting towards combining them within a joint learning framework, especially emphasizing feature sharing between the two tasks. The major contributions of this study lie in comprehensively tightening the coupling between semantic segmentation and stereo matching. Specifically, this study introduces three novelties: (1) a tightly coupled, gated feature fusion strategy, (2) a hierarchical deep supervision strategy, and (3) a coupling tightening loss function. The combined use of these technical contributions results in TiCoSS, a state-of-the-art joint learning framework that simultaneously tackles semantic segmentation and stereo matching. Through extensive experiments on the KITTI and vKITTI2 datasets, along with qualitative and quantitative analyses, we validate the effectiveness of our developed strategies and loss function, and demonstrate its superior performance compared to prior arts, with a notable increase in mIoU by over 9%. Our source code will be publicly available at mias.group/TiCoSS upon publication.
♻ ☆ 360 Layout Estimation via Orthogonal Planes Disentanglement and Multi-view Geometric Consistency Perception
Existing panoramic layout estimation solutions tend to recover room boundaries from a vertically compressed sequence, yielding imprecise results as the compression process often muddles the semantics between various planes. Besides, these data-driven approaches impose an urgent demand for massive data annotations, which are laborious and time-consuming. For the first problem, we propose an orthogonal plane disentanglement network (termed DOPNet) to distinguish ambiguous semantics. DOPNet consists of three modules that are integrated to deliver distortion-free, semantics-clean, and detail-sharp disentangled representations, which benefit the subsequent layout recovery. For the second problem, we present an unsupervised adaptation technique tailored for horizon-depth and ratio representations. Concretely, we introduce an optimization strategy for decision-level layout analysis and a 1D cost volume construction method for feature-level multi-view aggregation, both of which are designed to fully exploit the geometric consistency across multiple perspectives. The optimizer provides a reliable set of pseudo-labels for network training, while the 1D cost volume enriches each view with comprehensive scene information derived from other perspectives. Extensive experiments demonstrate that our solution outperforms other SoTA models on both monocular layout estimation and multi-view layout estimation tasks. Cobe can be available at https://github.com/zhijieshen-bjtu/MV-DOPNet.
comment: Accept to TPAMI2024. arXiv admin note: substantial text overlap with arXiv:2303.00971
♻ ☆ Embodiment: Self-Supervised Depth Estimation Based on Camera Models
Depth estimation is a critical topic for robotics and vision-related tasks. In monocular depth estimation, in comparison with supervised learning that requires expensive ground truth labeling, self-supervised methods possess great potential due to no labeling cost. However, self-supervised learning still has a large gap with supervised learning in 3D reconstruction and depth estimation performance. Meanwhile, scaling is also a major issue for monocular unsupervised depth estimation, which commonly still needs ground truth scale from GPS, LiDAR, or existing maps to correct. In the era of deep learning, existing methods primarily rely on exploring image relationships to train unsupervised neural networks, while the physical properties of the camera itself such as intrinsics and extrinsics are often overlooked. These physical properties are not just mathematical parameters; they are embodiments of the camera's interaction with the physical world. By embedding these physical properties into the deep learning model, we can calculate depth priors for ground regions and regions connected to the ground based on physical principles, providing free supervision signals without the need for additional sensors. This approach is not only easy to implement but also enhances the effects of all unsupervised methods by embedding the camera's physical properties into the model, thereby achieving an embodied understanding of the real world.
♻ ☆ Using Texture to Classify Forests Separately from Vegetation
Identifying terrain within satellite image data is a key issue in geographical information sciences, with numerous environmental and safety implications. Many techniques exist to derive classifications from spectral data captured by satellites. However, the ability to reliably classify vegetation remains a challenge. In particular, no precise methods exist for classifying forest vs. non-forest vegetation in high-level satellite images. This paper provides an initial proposal for a static, algorithmic process to identify forest regions in satellite image data through texture features created from detected edges and the NDVI ratio captured by Sentinel-2 satellite images. With strong initial results, this paper also identifies the next steps to improve the accuracy of the classification and verification processes.
Information Retrieval
☆ Jina-ColBERT-v2: A General-Purpose Multilingual Late Interaction Retriever
Multi-vector dense models, such as ColBERT, have proven highly effective in information retrieval. ColBERT's late interaction scoring approximates the joint query-document attention seen in cross-encoders while maintaining inference efficiency closer to traditional dense retrieval models, thanks to its bi-encoder architecture and recent optimizations in indexing and search. In this paper, we introduce several improvements to the ColBERT model architecture and training pipeline, leveraging techniques successful in the more established single-vector embedding model paradigm, particularly those suited for heterogeneous multilingual data. Our new model, Jina-ColBERT-v2, demonstrates strong performance across a range of English and multilingual retrieval tasks, while also cutting storage requirements by up to 50% compared to previous models.
Transformers Meet ACT-R: Repeat-Aware and Sequential Listening Session Recommendation RecSys'2024
Music streaming services often leverage sequential recommender systems to predict the best music to showcase to users based on past sequences of listening sessions. Nonetheless, most sequential recommendation methods ignore or insufficiently account for repetitive behaviors. This is a crucial limitation for music recommendation, as repeatedly listening to the same song over time is a common phenomenon that can even change the way users perceive this song. In this paper, we introduce PISA (Psychology-Informed Session embedding using ACT-R), a session-level sequential recommender system that overcomes this limitation. PISA employs a Transformer architecture learning embedding representations of listening sessions and users using attention mechanisms inspired by Anderson's ACT-R (Adaptive Control of Thought-Rational), a cognitive architecture modeling human information access and memory dynamics. This approach enables us to capture dynamic and repetitive patterns from user behaviors, allowing us to effectively predict the songs they will listen to in subsequent sessions, whether they are repeated or new ones. We demonstrate the empirical relevance of PISA using both publicly available listening data from Last.fm and proprietary data from Deezer, a global music streaming service, confirming the critical importance of repetition modeling for sequential listening session recommendation. Along with this paper, we publicly release our proprietary dataset to foster future research in this field, as well as the source code of PISA to facilitate its future use.
comment: 11 pages. Accepted by RecSys'2024, full paper
☆ Is text normalization relevant for classifying medieval charters?
This study examines the impact of historical text normalization on the classification of medieval charters, specifically focusing on document dating and locating. Using a data set of Middle High German charters from a digital archive, we evaluate various classifiers, including traditional and transformer-based models, with and without normalization. Our results indicate that the given normalization minimally improves locating tasks but reduces accuracy for dating, implying that original texts contain crucial features that normalization may obscure. We find that support vector machines and gradient boosting outperform other models, questioning the efficiency of transformers for this use case. Results suggest a selective approach to historical text normalization, emphasizing the significance of preserving some textual characteristics that are critical for classification tasks in document analysis.
comment: This preprint has not undergone peer review or any post-submission improvements or corrections
☆ Do Recommender Systems Promote Local Music? A Reproducibility Study Using Music Streaming Data
This paper examines the influence of recommender systems on local music representation, discussing prior findings from an empirical study on the LFM-2b public dataset. This prior study argued that different recommender systems exhibit algorithmic biases shifting music consumption either towards or against local content. However, LFM-2b users do not reflect the diverse audience of music streaming services. To assess the robustness of this study's conclusions, we conduct a comparative analysis using proprietary listening data from a global music streaming service, which we publicly release alongside this paper. We observe significant differences in local music consumption patterns between our dataset and LFM-2b, suggesting that caution should be exercised when drawing conclusions on local music based solely on LFM-2b. Moreover, we show that the algorithmic biases exhibited in the original work vary in our dataset, and that several unexplored model parameters can significantly influence these biases and affect the study's conclusion on both datasets. Finally, we discuss the complexity of accurately labeling local music, emphasizing the risk of misleading conclusions due to unreliable, biased, or incomplete labels. To encourage further research and ensure reproducibility, we have publicly shared our dataset and code.
☆ SynDL: A Large-Scale Synthetic Test Collection
Large-scale test collections play a crucial role in Information Retrieval (IR) research. However, according to the Cranfield paradigm and the research into publicly available datasets, the existing information retrieval research studies are commonly developed on small-scale datasets that rely on human assessors for relevance judgments - a time-intensive and expensive process. Recent studies have shown the strong capability of Large Language Models (LLMs) in producing reliable relevance judgments with human accuracy but at a greatly reduced cost. In this paper, to address the missing large-scale ad-hoc document retrieval dataset, we extend the TREC Deep Learning Track (DL) test collection via additional language model synthetic labels to enable researchers to test and evaluate their search systems at a large scale. Specifically, such a test collection includes more than 1,900 test queries from the previous years of tracks. We compare system evaluation with past human labels from past years and find that our synthetically created large-scale test collection can lead to highly correlated system rankings.
comment: 9 pages
☆ Rethinking Sparse Lexical Representations for Image Retrieval in the Age of Rising Multi-Modal Large Language Models ECCV 2024
In this paper, we rethink sparse lexical representations for image retrieval. By utilizing multi-modal large language models (M-LLMs) that support visual prompting, we can extract image features and convert them into textual data, enabling us to utilize efficient sparse retrieval algorithms employed in natural language processing for image retrieval tasks. To assist the LLM in extracting image features, we apply data augmentation techniques for key expansion and analyze the impact with a metric for relevance between images and textual data. We empirically show the superior precision and recall performance of our image retrieval method compared to conventional vision-language model-based methods on the MS-COCO, PASCAL VOC, and NUS-WIDE datasets in a keyword-based image retrieval scenario, where keywords serve as search queries. We also demonstrate that the retrieval performance can be improved by iteratively incorporating keywords into search queries.
comment: Accepted to ECCV 2024 Workshops: 2nd Workshop on Traditional Computer Vision in the Age of Deep Learning (TradiCV)
☆ Efficient Transfer Learning Framework for Cross-Domain Click-Through Rate Prediction
Natural content and advertisement coexist in industrial recommendation systems but differ in data distribution. Concretely, traffic related to the advertisement is considerably sparser compared to that of natural content, which motivates the development of transferring knowledge from the richer source natural content domain to the sparser advertising domain. The challenges include the inefficiencies arising from the management of extensive source data and the problem of 'catastrophic forgetting' that results from the CTR model's daily updating. To this end, we propose a novel tri-level asynchronous framework, i.e., Efficient Transfer Learning Framework for Cross-Domain Click-Through Rate Prediction (E-CDCTR), to transfer comprehensive knowledge of natural content to advertisement CTR models. This framework consists of three key components: Tiny Pre-training Model ((TPM), which trains a tiny CTR model with several basic features on long-term natural data; Complete Pre-training Model (CPM), which trains a CTR model holding network structure and input features the same as target advertisement on short-term natural data; Advertisement CTR model (A-CTR), which derives its parameter initialization from CPM together with multiple historical embeddings from TPM as extra feature and then fine-tunes on advertisement data. TPM provides richer representations of user and item for both the CPM and A-CTR, effectively alleviating the forgetting problem inherent in the daily updates. CPM further enhances the advertisement model by providing knowledgeable initialization, thereby alleviating the data sparsity challenges typically encountered by advertising CTR models. Such a tri-level cross-domain transfer learning framework offers an efficient solution to address both data sparsity and `catastrophic forgetting', yielding remarkable improvements.
☆ A Prototype Model of Zero-Trust Architecture Blockchain with EigenTrust-Based Practical Byzantine Fault Tolerance Protocol to Manage Decentralized Clinical Trials
The COVID-19 pandemic necessitated the emergence of decentralized Clinical Trials (DCTs) due to patient retention, accelerate trials, improve data accessibility, enable virtual care, and facilitate seamless communication through integrated systems. However, integrating systems in DCTs exposes clinical data to potential security threats, making them susceptible to theft at any stage, a high risk of protocol deviations, and monitoring issues. To mitigate these challenges, blockchain technology serves as a secure framework, acting as a decentralized ledger, creating an immutable environment by establishing a zero-trust architecture, where data are deemed untrusted until verified. In combination with Internet of Things (IoT)-enabled wearable devices, blockchain secures the transfer of clinical trial data on private blockchains during DCT automation and operations. This paper proposes a prototype model of the Zero-Trust Architecture Blockchain (z-TAB) to integrate patient-generated clinical trial data during DCT operation management. The EigenTrust-based Practical Byzantine Fault Tolerance (T-PBFT) algorithm has been incorporated as a consensus protocol, leveraging Hyperledger Fabric. Furthermore, the Internet of Things (IoT) has been integrated to streamline data processing among stakeholders within the blockchain platforms. Rigorous evaluation has been done to evaluate the quality of the system.
comment: NA
☆ Longitudinal Modularity, a Modularity for Link Streams
Temporal networks are commonly used to model real-life phenomena. When these phenomena represent interactions and are captured at a fine-grained temporal resolution, they are modeled as link streams. Community detection is an essential network analysis task. Although many methods exist for static networks, and some methods have been developed for temporal networks represented as sequences of snapshots, few works can handle link streams. This article introduces the first adaptation of the well-known Modularity quality function to link streams. Unlike existing methods, it is independent of the time scale of analysis. After introducing the quality function, and its relation to existing static and dynamic definitions of Modularity, we show experimentally its relevance for dynamic community evaluation.
♻ ☆ Smart Multi-Modal Search: Contextual Sparse and Dense Embedding Integration in Adobe Express CIKM 2024
As user content and queries become increasingly multi-modal, the need for effective multi-modal search systems has grown. Traditional search systems often rely on textual and metadata annotations for indexed images, while multi-modal embeddings like CLIP enable direct search using text and image embeddings. However, embedding-based approaches face challenges in integrating contextual features such as user locale and recency. Building a scalable multi-modal search system requires fine-tuning several components. This paper presents a multi-modal search architecture and a series of AB tests that optimize embeddings and multi-modal technologies in Adobe Express template search. We address considerations such as embedding model selection, the roles of embeddings in matching and ranking, and the balance between dense and sparse embeddings. Our iterative approach demonstrates how utilizing sparse, dense, and contextual features enhances short and long query search, significantly reduces null rates (over 70\%), and increases click-through rates (CTR). Our findings provide insights into developing robust multi-modal search systems, thereby enhancing relevance for complex queries.
comment: CIKM 2024 (International Conference on Information and Knowledge Management), Multimodal Search and Recommendations Workshop
♻ ☆ GenRec: Generative Sequential Recommendation with Large Language Models
Sequential recommendation is a task to capture hidden user preferences from historical user item interaction data and recommend next items for the user. Significant progress has been made in this domain by leveraging classification based learning methods. Inspired by the recent paradigm of 'pretrain, prompt and predict' in NLP, we consider sequential recommendation as a sequence to sequence generation task and propose a novel model named Generative Recommendation (GenRec). Unlike classification based models that learn explicit user and item representations, GenRec utilizes the sequence modeling capability of Transformer and adopts the masked item prediction objective to effectively learn the hidden bidirectional sequential patterns. Different from existing generative sequential recommendation models, GenRec does not rely on manually designed hard prompts. The input to GenRec is textual user item sequence and the output is top ranked next items. Moreover, GenRec is lightweight and requires only a few hours to train effectively in low-resource settings, making it highly applicable to real-world scenarios and helping to democratize large language models in the sequential recommendation domain. Our extensive experiments have demonstrated that GenRec generalizes on various public real-world datasets and achieves state-of-the-art results. Our experiments also validate the effectiveness of the the proposed masked item prediction objective that improves the model performance by a large margin.
♻ ☆ Summaries, Highlights, and Action items: Design, implementation and evaluation of an LLM-powered meeting recap system SC
Meetings play a critical infrastructural role in the coordination of work. In recent years, due to shift to hybrid and remote work, more meetings are moving to online Computer Mediated Spaces. This has led to new problems (e.g. more time spent in less engaging meetings) and new opportunities (e.g. automated transcription/captioning and recap support). Recent advances in large language models (LLMs) for dialog summarization have the potential to improve the experience of meetings by reducing individuals' meeting load and increasing the clarity and alignment of meeting outputs. Despite this potential, they face technological limitation due to long transcripts and inability to capture diverse recap needs based on user's context. To address these gaps, we design, implement and evaluate in-context a meeting recap system. We first conceptualize two salient recap representations -- important highlights, and a structured, hierarchical minutes view. We develop a system to operationalize the representations with dialogue summarization as its building blocks. Finally, we evaluate the effectiveness of the system with seven users in the context of their work meetings. Our findings show promise in using LLM-based dialogue summarization for meeting recap and the need for both representations in different contexts. However, we find that LLM-based recap still lacks an understanding of whats personally relevant to participants, can miss important details, and mis-attributions can be detrimental to group dynamics. We identify collaboration opportunities such as a shared recap document that a high quality recap enables. We report on implications for designing AI systems to partner with users to learn and improve from natural interactions to overcome the limitations related to personal relevance and summarization quality.
comment: in review for CSCW 24
♻ ☆ Use of a Structured Knowledge Base Enhances Metadata Curation by Large Language Models
Metadata play a crucial role in ensuring the findability, accessibility, interoperability, and reusability of datasets. This paper investigates the potential of large language models (LLMs), specifically GPT-4, to improve adherence to metadata standards. We conducted experiments on 200 random data records describing human samples relating to lung cancer from the NCBI BioSample repository, evaluating GPT-4's ability to suggest edits for adherence to metadata standards. We computed the adherence accuracy of field name-field value pairs through a peer review process, and we observed a marginal average improvement in adherence to the standard data dictionary from 79% to 80% (p<0.5). We then prompted GPT-4 with domain information in the form of the textual descriptions of CEDAR templates and recorded a significant improvement to 97% from 79% (p<0.01). These results indicate that, while LLMs may not be able to correct legacy metadata to ensure satisfactory adherence to standards when unaided, they do show promise for use in automated metadata curation when integrated with a structured knowledge base
Machine Learning
☆ A Score-Based Density Formula, with Applications in Diffusion Generative Models
Score-based generative models (SGMs) have revolutionized the field of generative modeling, achieving unprecedented success in generating realistic and diverse content. Despite empirical advances, the theoretical basis for why optimizing the evidence lower bound (ELBO) on the log-likelihood is effective for training diffusion generative models, such as DDPMs, remains largely unexplored. In this paper, we address this question by establishing a density formula for a continuous-time diffusion process, which can be viewed as the continuous-time limit of the forward process in an SGM. This formula reveals the connection between the target density and the score function associated with each step of the forward process. Building on this, we demonstrate that the minimizer of the optimization objective for training DDPMs nearly coincides with that of the true objective, providing a theoretical foundation for optimizing DDPMs using the ELBO. Furthermore, we offer new insights into the role of score-matching regularization in training GANs, the use of ELBO in diffusion classifiers, and the recently proposed diffusion loss.
☆ UV-free Texture Generation with Denoising and Geodesic Heat Diffusions
Seams, distortions, wasted UV space, vertex-duplication, and varying resolution over the surface are the most prominent issues of the standard UV-based texturing of meshes. These issues are particularly acute when automatic UV-unwrapping techniques are used. For this reason, instead of generating textures in automatically generated UV-planes like most state-of-the-art methods, we propose to represent textures as coloured point-clouds whose colours are generated by a denoising diffusion probabilistic model constrained to operate on the surface of 3D objects. Our sampling and resolution agnostic generative model heavily relies on heat diffusion over the surface of the meshes for spatial communication between points. To enable processing of arbitrarily sampled point-cloud textures and ensure long-distance texture consistency we introduce a fast re-sampling of the mesh spectral properties used during the heat diffusion and introduce a novel heat-diffusion-based self-attention mechanism. Our code and pre-trained models are available at github.com/simofoti/UV3-TeD.
☆ Reinforcement Learning without Human Feedback for Last Mile Fine-Tuning of Large Language Models
Reinforcement learning is used to align language models with human preference signals after first pre-training the model to predict the next token of text within a large corpus using likelihood maximization. Before being deployed in a specific domain, models are often further fine-tuned on task specific data. Since human preferences are often unavailable for the last step, it is performed using likelihood maximization as that is the typical default method. However, reinforcement learning has other advantages besides facilitating alignment to a human derived reward function. For one, whereas likelihood maximization is a form of imitation learning in which the model is trained on what to do under ideal conditions, reinforcement learning is not limited to demonstrating actions just for optimally reached states and trains a model what to do under a range of scenarios as it explores the policy space. In addition, it also trains a model what not to do, suppressing competitive but poor actions. This work develops a framework for last-mile fine-tuning using reinforcement learning and tests whether it garners performance gains. The experiments center on abstractive summarization, but the framework is general and broadly applicable. Use of the procedure produced significantly better results than likelihood maximization when comparing raw predictions. For the specific data tested, the gap could be bridged by employing post-processing of the maximum likelihood outputs. Nonetheless, the framework offers a new avenue for model optimization in situations where post-processing may be less straightforward or effective, and it can be extended to include more complex classes of undesirable outputs to penalize and train against, such as hallucinations.
☆ A Gradient Analysis Framework for Rewarding Good and Penalizing Bad Examples in Language Models
Beyond maximum likelihood estimation (MLE), the standard objective of a language model (LM) that optimizes good examples probabilities, many studies have explored ways that also penalize bad examples for enhancing the quality of output distribution, including unlikelihood training, exponential maximizing average treatment effect (ExMATE), and direct preference optimization (DPO). To systematically compare these methods and further provide a unified recipe for LM optimization, in this paper, we present a unique angle of gradient analysis of loss functions that simultaneously reward good examples and penalize bad ones in LMs. Through both mathematical results and experiments on CausalDialogue and Anthropic HH-RLHF datasets, we identify distinct functional characteristics among these methods. We find that ExMATE serves as a superior surrogate for MLE, and that combining DPO with ExMATE instead of MLE further enhances both the statistical (5-7%) and generative (+18% win rate) performance.
☆ Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming
Recent advances in language models have achieved significant progress. GPT-4o, as a new milestone, has enabled real-time conversations with humans, demonstrating near-human natural fluency. Such human-computer interaction necessitates models with the capability to perform reasoning directly with the audio modality and generate output in streaming. However, this remains beyond the reach of current academic models, as they typically depend on extra TTS systems for speech synthesis, resulting in undesirable latency. This paper introduces the Mini-Omni, an audio-based end-to-end conversational model, capable of real-time speech interaction. To achieve this capability, we propose a text-instructed speech generation method, along with batch-parallel strategies during inference to further boost the performance. Our method also helps to retain the original model's language capabilities with minimal degradation, enabling other works to establish real-time interaction capabilities. We call this training method "Any Model Can Talk". We also introduce the VoiceAssistant-400K dataset to fine-tune models optimized for speech output. To our best knowledge, Mini-Omni is the first fully end-to-end, open-source model for real-time speech interaction, offering valuable potential for future research.
comment: 10 pages
☆ A GREAT Architecture for Edge-Based Graph Problems Like TSP
In the last years, many neural network-based approaches have been proposed to tackle combinatorial optimization problems such as routing problems. Many of these approaches are based on graph neural networks (GNNs) or related transformers, operating on the Euclidean coordinates representing the routing problems. However, GNNs are inherently not well suited to operate on dense graphs, such as in routing problems. Furthermore, models operating on Euclidean coordinates cannot be applied to non-Euclidean versions of routing problems that are often found in real-world settings. To overcome these limitations, we propose a novel GNN-related edge-based neural model called Graph Edge Attention Network (GREAT). We evaluate the performance of GREAT in the edge-classification task to predict optimal edges in the Traveling Salesman Problem (TSP). We can use such a trained GREAT model to produce sparse TSP graph instances, keeping only the edges GREAT finds promising. Compared to other, non-learning-based methods to sparsify TSP graphs, GREAT can produce very sparse graphs while keeping most of the optimal edges. Furthermore, we build a reinforcement learning-based GREAT framework which we apply to Euclidean and non-Euclidean asymmetric TSP. This framework achieves state-of-the-art results.
comment: 15 pages, 7 figures
☆ Enhanced forecasting of stock prices based on variational mode decomposition, PatchTST, and adaptive scale-weighted layer
The significant fluctuations in stock index prices in recent years highlight the critical need for accurate forecasting to guide investment and financial strategies. This study introduces a novel composite forecasting framework that integrates variational mode decomposition (VMD), PatchTST, and adaptive scale-weighted layer (ASWL) to address these challenges. Utilizing datasets of four major stock indices--SP500, DJI, SSEC, and FTSE--from 2000 to 2024, the proposed method first decomposes the raw price series into intrinsic mode functions (IMFs) using VMD. Each IMF is then modeled with PatchTST to capture temporal patterns effectively. The ASWL module is applied to incorporate scale information, enhancing prediction accuracy. The final forecast is derived by aggregating predictions from all IMFs. The VMD-PatchTST-ASWL framework demonstrates significant improvements in forecasting accuracy compared to traditional models, showing robust performance across different indices. This innovative approach provides a powerful tool for stock index price forecasting, with potential applications in various financial analysis and investment decision-making contexts.
☆ SympGNNs: Symplectic Graph Neural Networks for identifiying high-dimensional Hamiltonian systems and node classification
Existing neural network models to learn Hamiltonian systems, such as SympNets, although accurate in low-dimensions, struggle to learn the correct dynamics for high-dimensional many-body systems. Herein, we introduce Symplectic Graph Neural Networks (SympGNNs) that can effectively handle system identification in high-dimensional Hamiltonian systems, as well as node classification. SympGNNs combines symplectic maps with permutation equivariance, a property of graph neural networks. Specifically, we propose two variants of SympGNNs: i) G-SympGNN and ii) LA-SympGNN, arising from different parameterizations of the kinetic and potential energy. We demonstrate the capabilities of SympGNN on two physical examples: a 40-particle coupled Harmonic oscillator, and a 2000-particle molecular dynamics simulation in a two-dimensional Lennard-Jones potential. Furthermore, we demonstrate the performance of SympGNN in the node classification task, achieving accuracy comparable to the state-of-the-art. We also empirically show that SympGNN can overcome the oversmoothing and heterophily problems, two key challenges in the field of graph neural networks.
comment: 17 pages, 10 figures
☆ CW-CNN & CW-AN: Convolutional Networks and Attention Networks for CW-Complexes
We present a novel framework for learning on CW-complex structured data points. Recent advances have discussed CW-complexes as ideal learning representations for problems in cheminformatics. However, there is a lack of available machine learning methods suitable for learning on CW-complexes. In this paper we develop notions of convolution and attention that are well defined for CW-complexes. These notions enable us to create the first neural network that can receive a CW-complex as input. We illustrate and interpret this framework in the context of supervised prediction.
☆ A Catalog of Fairness-Aware Practices in Machine Learning Engineering
Machine learning's widespread adoption in decision-making processes raises concerns about fairness, particularly regarding the treatment of sensitive features and potential discrimination against minorities. The software engineering community has responded by developing fairness-oriented metrics, empirical studies, and approaches. However, there remains a gap in understanding and categorizing practices for engineering fairness throughout the machine learning lifecycle. This paper presents a novel catalog of practices for addressing fairness in machine learning derived from a systematic mapping study. The study identifies and categorizes 28 practices from existing literature, mapping them onto different stages of the machine learning lifecycle. From this catalog, the authors extract actionable items and implications for both researchers and practitioners in software engineering. This work aims to provide a comprehensive resource for integrating fairness considerations into the development and deployment of machine learning systems, enhancing their reliability, accountability, and credibility.
☆ Entropic Distribution Matching in Supervised Fine-tuning of LLMs: Less Overfitting and Better Diversity
Large language models rely on Supervised Fine-Tuning (SFT) to specialize in downstream tasks. Cross Entropy (CE) loss is the de facto choice in SFT, but it often leads to overfitting and limited output diversity due to its aggressive updates to the data distribution. This paper aim to address these issues by introducing the maximum entropy principle, which favors models with flatter distributions that still effectively capture the data. Specifically, we develop a new distribution matching method called GEM, which solves reverse Kullback-Leibler divergence minimization with an entropy regularizer. For the SFT of Llama-3-8B models, GEM outperforms CE in several aspects. First, when applied to the UltraFeedback dataset to develop general instruction-following abilities, GEM exhibits reduced overfitting, evidenced by lower perplexity and better performance on the IFEval benchmark. Furthermore, GEM enhances output diversity, leading to performance gains of up to 7 points on math reasoning and code generation tasks using best-of-n sampling, even without domain-specific data. Second, when fine-tuning with domain-specific datasets for math reasoning and code generation, GEM also shows less overfitting and improvements of up to 10 points compared with CE.
☆ Iterative Graph Alignment
By compressing diverse narratives, LLMs go beyond memorization, achieving intelligence by capturing generalizable causal relationships. However, they suffer from local 'representation gaps' due to insufficient training data diversity, limiting their real-world utility, especially in tasks requiring strict alignment to rules. Traditional alignment methods relying on heavy human annotations are inefficient and unscalable. Recent self-alignment techniques also fall short, as they often depend on self-selection based prompting and memorization-based learning. To address these issues, we introduce Iterative Graph Alignment (IGA), an annotation-free rule-based alignment algorithm. A teacher model (VLM) employs Iterative Graph Prompting (IGP) to create logical graphs and reference answers. The student model (LLM) identifies local knowledge gaps by attempting to align its responses with these references, collaborating with helper models to generate diverse answers. These aligned responses are then used for iterative supervised fine-tuning (SFT). Our evaluations across five rule-based scenarios demonstrate IGP's effectiveness, with a 73.12\% alignment improvement in Claude Sonnet 3.5, and Llama3-8B-Instruct achieving an 86.20\% improvement, outperforming Claude Sonnet 3.5 in rule-based alignment.
comment: 12 pages, 4 figures
☆ Optimal Parallelization of Boosting
Recent works on the parallel complexity of Boosting have established strong lower bounds on the tradeoff between the number of training rounds $p$ and the total parallel work per round $t$. These works have also presented highly non-trivial parallel algorithms that shed light on different regions of this tradeoff. Despite these advancements, a significant gap persists between the theoretical lower bounds and the performance of these algorithms across much of the tradeoff space. In this work, we essentially close this gap by providing both improved lower bounds on the parallel complexity of weak-to-strong learners, and a parallel Boosting algorithm whose performance matches these bounds across the entire $p$ vs.~$t$ compromise spectrum, up to logarithmic factors. Ultimately, this work settles the true parallel complexity of Boosting algorithms that are nearly sample-optimal.
☆ Towards Efficient Modelling of String Dynamics: A Comparison of State Space and Koopman based Deep Learning Methods
This paper presents an examination of State Space Models (SSM) and Koopman-based deep learning methods for modelling the dynamics of both linear and non-linear stiff strings. Through experiments with datasets generated under different initial conditions and sample rates, we assess the capacity of these models to accurately model the complex behaviours observed in string dynamics. Our findings indicate that our proposed Koopman-based model performs as well as or better than other existing approaches in non-linear cases for long-sequence modelling. We inform the design of these architectures with the structure of the problems at hand. Although challenges remain in extending model predictions beyond the training horizon (i.e., extrapolation), the focus of our investigation lies in the models' ability to generalise across different initial conditions within the training time interval. This research contributes insights into the physical modelling of dynamical systems (in particular those addressing musical acoustics) by offering a comparative overview of these and previous methods and introducing innovative strategies for model improvement. Our results highlight the efficacy of these models in simulating non-linear dynamics and emphasise their wide-ranging applicability in accurately modelling dynamical systems over extended sequences.
comment: Accepted to DAFx2024
☆ 3D Pose-Based Temporal Action Segmentation for Figure Skating: A Fine-Grained and Jump Procedure-Aware Annotation Approach
Understanding human actions from videos is essential in many domains, including sports. In figure skating, technical judgments are performed by watching skaters' 3D movements, and its part of the judging procedure can be regarded as a Temporal Action Segmentation (TAS) task. TAS tasks in figure skating that automatically assign temporal semantics to video are actively researched. However, there is a lack of datasets and effective methods for TAS tasks requiring 3D pose data. In this study, we first created the FS-Jump3D dataset of complex and dynamic figure skating jumps using optical markerless motion capture. We also propose a new fine-grained figure skating jump TAS dataset annotation method with which TAS models can learn jump procedures. In the experimental results, we validated the usefulness of 3D pose features as input and the fine-grained dataset for the TAS model in figure skating. FS-Jump3D Dataset is available at https://github.com/ryota-skating/FS-Jump3D.
comment: 10 pages, 7th ACM International Workshop on Multimedia Content Analysis in Sports
☆ Turbulence Strength $C_n^2$ Estimation from Video using Physics-based Deep Learning
Images captured from a long distance suffer from dynamic image distortion due to turbulent flow of air cells with random temperatures, and thus refractive indices. This phenomenon, known as image dancing, is commonly characterized by its refractive-index structure constant $C_n^2$ as a measure of the turbulence strength. For many applications such as atmospheric forecast model, long-range/astronomy imaging, and aviation safety, optical communication technology, $C_n^2$ estimation is critical for accurately sensing the turbulent environment. Previous methods for $C_n^2$ estimation include estimation from meteorological data (temperature, relative humidity, wind shear, etc.) for single-point measurements, two-ended pathlength measurements from optical scintillometer for path-averaged $C_n^2$, and more recently estimating $C_n^2$ from passive video cameras for low cost and hardware complexity. In this paper, we present a comparative analysis of classical image gradient methods for $C_n^2$ estimation and modern deep learning-based methods leveraging convolutional neural networks. To enable this, we collect a dataset of video capture along with reference scintillometer measurements for ground truth, and we release this unique dataset to the scientific community. We observe that deep learning methods can achieve higher accuracy when trained on similar data, but suffer from generalization errors to other, unseen imagery as compared to classical methods. To overcome this trade-off, we present a novel physics-based network architecture that combines learned convolutional layers with a differentiable image gradient method that maintains high accuracy while being generalizable across image datasets.
comment: Code Available: https://github.com/Riponcs/Cn2Estimation
☆ Towards Infusing Auxiliary Knowledge for Distracted Driver Detection KDD
Distracted driving is a leading cause of road accidents globally. Identification of distracted driving involves reliably detecting and classifying various forms of driver distraction (e.g., texting, eating, or using in-car devices) from in-vehicle camera feeds to enhance road safety. This task is challenging due to the need for robust models that can generalize to a diverse set of driver behaviors without requiring extensive annotated datasets. In this paper, we propose KiD3, a novel method for distracted driver detection (DDD) by infusing auxiliary knowledge about semantic relations between entities in a scene and the structural configuration of the driver's pose. Specifically, we construct a unified framework that integrates the scene graphs, and driver pose information with the visual cues in video frames to create a holistic representation of the driver's actions.Our results indicate that KiD3 achieves a 13.64% accuracy improvement over the vision-only baseline by incorporating such auxiliary knowledge with visual information.
comment: Accepted at KiL 2024: Workshop on Knowledge-infused Learning co-located with 30th ACM KDD Conference
☆ Hyperdimensional Vector Tsetlin Machines with Applications to Sequence Learning and Generation
We construct a two-layered model for learning and generating sequential data that is both computationally fast and competitive with vanilla Tsetlin machines, adding numerous advantages. Through the use of hyperdimensional vector computing (HVC) algebras and Tsetlin machine clause structures, we demonstrate that the combination of both inherits the generality of data encoding and decoding of HVC with the fast interpretable nature of Tsetlin machines to yield a powerful machine learning model. We apply the approach in two areas, namely in forecasting, generating new sequences, and classification. For the latter, we derive results for the entire UCR Time Series Archive and compare with the standard benchmarks to see how well the method competes in time series classification.
☆ Blending Low and High-Level Semantics of Time Series for Better Masked Time Series Generation
State-of-the-art approaches in time series generation (TSG), such as TimeVQVAE, utilize vector quantization-based tokenization to effectively model complex distributions of time series. These approaches first learn to transform time series into a sequence of discrete latent vectors, and then a prior model is learned to model the sequence. The discrete latent vectors, however, only capture low-level semantics (\textit{e.g.,} shapes). We hypothesize that higher-fidelity time series can be generated by training a prior model on more informative discrete latent vectors that contain both low and high-level semantics (\textit{e.g.,} characteristic dynamics). In this paper, we introduce a novel framework, termed NC-VQVAE, to integrate self-supervised learning into those TSG methods to derive a discrete latent space where low and high-level semantics are captured. Our experimental results demonstrate that NC-VQVAE results in a considerable improvement in the quality of synthetic samples.
☆ Data Quality Monitoring through Transfer Learning on Anomaly Detection for the Hadron Calorimeters
The proliferation of sensors brings an immense volume of spatio-temporal (ST) data in many domains for various purposes, including monitoring, diagnostics, and prognostics applications. Data curation is a time-consuming process for a large volume of data, making it challenging and expensive to deploy data analytics platforms in new environments. Transfer learning (TL) mechanisms promise to mitigate data sparsity and model complexity by utilizing pre-trained models for a new task. Despite the triumph of TL in fields like computer vision and natural language processing, efforts on complex ST models for anomaly detection (AD) applications are limited. In this study, we present the potential of TL within the context of AD for the Hadron Calorimeter of the Compact Muon Solenoid experiment at CERN. We have transferred the ST AD models trained on data collected from one part of a calorimeter to another. We have investigated different configurations of TL on semi-supervised autoencoders of the ST AD models -- transferring convolutional, graph, and recurrent neural networks of both the encoder and decoder networks. The experiment results demonstrate that TL effectively enhances the model learning accuracy on a target subdetector. The TL achieves promising data reconstruction and AD performance while substantially reducing the trainable parameters of the AD models. It also improves robustness against anomaly contamination in the training data sets of the semi-supervised AD models.
comment: 28 pages, 15 figures, and 9 tables
☆ Subspace Representation Learning for Sparse Linear Arrays to Localize More Sources than Sensors: A Deep Learning Methodology
Localizing more sources than sensors with a sparse linear array (SLA) has long relied on minimizing a distance between two covariance matrices and recent algorithms often utilize semidefinite programming (SDP). Although deep neural network (DNN)-based methods offer new alternatives, they still depend on covariance matrix fitting. In this paper, we develop a novel methodology that estimates the co-array subspaces from a sample covariance for SLAs. Our methodology trains a DNN to learn signal and noise subspace representations that are invariant to the selection of bases. To learn such representations, we propose loss functions that gauge the separation between the desired and the estimated subspace. In particular, we propose losses that measure the length of the shortest path between subspaces viewed on a union of Grassmannians, and prove that it is possible for a DNN to approximate signal subspaces. The computation of learning subspaces of different dimensions is accelerated by a new batch sampling strategy called consistent rank sampling. The methodology is robust to array imperfections due to its geometry-agnostic and data-driven nature. In addition, we propose a fully end-to-end gridless approach that directly learns angles to study the possibility of bypassing subspace methods. Numerical results show that learning such subspace representations is more beneficial than learning covariances or angles. It outperforms conventional SDP-based methods such as the sparse and parametric approach (SPA) and existing DNN-based covariance reconstruction methods for a wide range of signal-to-noise ratios (SNRs), snapshots, and source numbers for both perfect and imperfect arrays.
comment: 13 pages. Submitted to the IEEE Transactions on Signal Processing
☆ sEMG-Driven Physics-Informed Gated Recurrent Networks for Modeling Upper Limb Multi-Joint Movement Dynamics
Exoskeletons and rehabilitation systems offer great potential for enhancing human strength and recovery through advanced human-machine interfaces (HMIs) that adapt to movement dynamics. However, the real-time application of physics-informed neural networks (PINNs) is limited by their reliance on fixed input lengths and surrogate models. This study introduces a novel physics-informed Gated Recurrent Network (PiGRN) designed to predict multi-joint torques using surface electromyography (sEMG) data. The PiGRN model employs a Gated Recurrent Unit (GRU) to convert time-series sEMG inputs into multi-joint kinematics and external loads, which are then integrated into an equation of motion to ensure consistency with physical laws. Experimental validation with sEMG data from five participants performing elbow flexion-extension tasks showed that the PiGRN model accurately predicted joint torques for 10 unfamiliar movements, with RMSE values between 4.02\% and 11.40\% and correlation coefficients ranging from 0.87 to 0.98. These findings highlight the PiGRN's potential for real-time exoskeleton and rehabilitation applications. Future research will explore more diverse datasets, improve musculoskeletal models, and investigate unsupervised learning methods.
☆ High-Dimensional Sparse Data Low-rank Representation via Accelerated Asynchronous Parallel Stochastic Gradient Descent
Data characterized by high dimensionality and sparsity are commonly used to describe real-world node interactions. Low-rank representation (LR) can map high-dimensional sparse (HDS) data to low-dimensional feature spaces and infer node interactions via modeling data latent associations. Unfortunately, existing optimization algorithms for LR models are computationally inefficient and slowly convergent on large-scale datasets. To address this issue, this paper proposes an Accelerated Asynchronous Parallel Stochastic Gradient Descent A2PSGD for High-Dimensional Sparse Data Low-rank Representation with three fold-ideas: a) establishing a lock-free scheduler to simultaneously respond to scheduling requests from multiple threads; b) introducing a greedy algorithm-based load balancing strategy for balancing the computational load among threads; c) incorporating Nesterov's accelerated gradient into the learning scheme to accelerate model convergence. Empirical studies show that A2PSGD outperforms existing optimization algorithms for HDS data LR in both accuracy and training time.
☆ CrisperWhisper: Accurate Timestamps on Verbatim Speech Transcriptions INTERSPEECH2024
We demonstrate that carefully adjusting the tokenizer of the Whisper speech recognition model significantly improves the precision of word-level timestamps when applying dynamic time warping to the decoder's cross-attention scores. We fine-tune the model to produce more verbatim speech transcriptions and employ several techniques to increase robustness against multiple speakers and background noise. These adjustments achieve state-of-the-art performance on benchmarks for verbatim speech transcription, word segmentation, and the timed detection of filler events, and can further mitigate transcription hallucinations. The code is available open https://github.com/nyrahealth/CrisperWhisper.
comment: Published at INTERSPEECH2024
Transformers Meet ACT-R: Repeat-Aware and Sequential Listening Session Recommendation RecSys'2024
Music streaming services often leverage sequential recommender systems to predict the best music to showcase to users based on past sequences of listening sessions. Nonetheless, most sequential recommendation methods ignore or insufficiently account for repetitive behaviors. This is a crucial limitation for music recommendation, as repeatedly listening to the same song over time is a common phenomenon that can even change the way users perceive this song. In this paper, we introduce PISA (Psychology-Informed Session embedding using ACT-R), a session-level sequential recommender system that overcomes this limitation. PISA employs a Transformer architecture learning embedding representations of listening sessions and users using attention mechanisms inspired by Anderson's ACT-R (Adaptive Control of Thought-Rational), a cognitive architecture modeling human information access and memory dynamics. This approach enables us to capture dynamic and repetitive patterns from user behaviors, allowing us to effectively predict the songs they will listen to in subsequent sessions, whether they are repeated or new ones. We demonstrate the empirical relevance of PISA using both publicly available listening data from Last.fm and proprietary data from Deezer, a global music streaming service, confirming the critical importance of repetition modeling for sequential listening session recommendation. Along with this paper, we publicly release our proprietary dataset to foster future research in this field, as well as the source code of PISA to facilitate its future use.
comment: 11 pages. Accepted by RecSys'2024, full paper
☆ Seeking the Sufficiency and Necessity Causal Features in Multimodal Representation Learning
Learning representations with a high Probability of Necessary and Sufficient Causes (PNS) has been shown to enhance deep learning models' ability. This task involves identifying causal features that are both sufficient (guaranteeing the outcome) and necessary (without which the outcome cannot occur). However, current research predominantly focuses on unimodal data, and extending PNS learning to multimodal settings presents significant challenges. The challenges arise as the conditions for PNS identifiability, Exogeneity and Monotonicity, need to be reconsidered in a multimodal context, where sufficient and necessary causal features are distributed across different modalities. To address this, we first propose conceptualizing multimodal representations as comprising modality-invariant and modality-specific components. We then analyze PNS identifiability for each component, while ensuring non-trivial PNS estimation. Finally, we formulate tractable optimization objectives that enable multimodal models to learn high-PNS representations, thereby enhancing their predictive performance. Experiments demonstrate the effectiveness of our method on both synthetic and real-world data.
☆ An Adaptive Latent Factorization of Tensors Model for Embedding Dynamic Communication Network
The Dynamic Communication Network (DCN) describes the interactions over time among various communication nodes, and it is widely used in Big-data applications as a data source. As the number of communication nodes increases and temporal slots accumulate, each node interacts in with only a few nodes in a given temporal slot, the DCN can be represented by an High-Dimensional Sparse (HDS) tensor. In order to extract rich behavioral patterns from an HDS tensor in DCN, this paper proposes an Adaptive Temporal-dependent Tensor low-rank representation (ATT) model. It adopts a three-fold approach: a) designing a temporal-dependent method to reconstruct temporal feature matrix, thereby precisely represent the data by capturing the temporal patterns; b) achieving hyper-parameters adaptation of the model via the Differential Evolutionary Algorithms (DEA) to avoid tedious hyper-parameters tuning; c) employing nonnegative learning schemes for the model parameters to effectively handle an the nonnegativity inherent in HDS data. The experimental results on four real-world DCNs demonstrate that the proposed ATT model significantly outperforms several state-of-the-art models in both prediction errors and convergence rounds.
comment: 10 pages, 2 figures
☆ Identifying Terrain Physical Parameters from Vision -- Towards Physical-Parameter-Aware Locomotion and Navigation
Identifying the physical properties of the surrounding environment is essential for robotic locomotion and navigation to deal with non-geometric hazards, such as slippery and deformable terrains. It would be of great benefit for robots to anticipate these extreme physical properties before contact; however, estimating environmental physical parameters from vision is still an open challenge. Animals can achieve this by using their prior experience and knowledge of what they have seen and how it felt. In this work, we propose a cross-modal self-supervised learning framework for vision-based environmental physical parameter estimation, which paves the way for future physical-property-aware locomotion and navigation. We bridge the gap between existing policies trained in simulation and identification of physical terrain parameters from vision. We propose to train a physical decoder in simulation to predict friction and stiffness from multi-modal input. The trained network allows the labeling of real-world images with physical parameters in a self-supervised manner to further train a visual network during deployment, which can densely predict the friction and stiffness from image data. We validate our physical decoder in simulation and the real world using a quadruped ANYmal robot, outperforming an existing baseline method. We show that our visual network can predict the physical properties in indoor and outdoor experiments while allowing fast adaptation to new environments.
☆ Android Malware Detection Based on RGB Images and Multi-feature Fusion
With the widespread adoption of smartphones, Android malware has become a significant challenge in the field of mobile device security. Current Android malware detection methods often rely on feature engineering to construct dynamic or static features, which are then used for learning. However, static feature-based methods struggle to counter code obfuscation, packing, and signing techniques, while dynamic feature-based methods involve time-consuming feature extraction. Image-based methods for Android malware detection offer better resilience against malware variants and polymorphic malware. This paper proposes an end-to-end Android malware detection technique based on RGB images and multi-feature fusion. The approach involves extracting Dalvik Executable (DEX) files, AndroidManifest.xml files, and API calls from APK files, converting them into grayscale images, and enhancing their texture features using Canny edge detection, histogram equalization, and adaptive thresholding techniques. These grayscale images are then combined into an RGB image containing multi-feature fusion information, which is analyzed using mainstream image classification models for Android malware detection. Extensive experiments demonstrate that the proposed method effectively captures Android malware characteristics, achieving an accuracy of up to 97.25%, outperforming existing detection methods that rely solely on DEX files as classification features. Additionally, ablation experiments confirm the effectiveness of using the three key files for feature representation in the proposed approach.
comment: 9 pages,10 figures
☆ Super-Resolution works for coastal simulations
Learning fine-scale details of a coastal ocean simulation from a coarse representation is a challenging task. For real-world applications, high-resolution simulations are necessary to advance understanding of many coastal processes, specifically, to predict flooding resulting from tsunamis and storm surges. We propose a Deep Network for Coastal Super-Resolution (DNCSR) for spatiotemporal enhancement to efficiently learn the high-resolution numerical solution. Given images of coastal simulations produced on low-resolution computational meshes using low polynomial order discontinuous Galerkin discretizations and a coarse temporal resolution, the proposed DNCSR learns to produce high-resolution free surface elevation and velocity visualizations in both time and space. To efficiently model the dynamic changes over time and space, we propose grid-aware spatiotemporal attention to project the temporal features to the spatial domain for non-local feature matching. The coordinate information is also utilized via positional encoding. For the final reconstruction, we use the spatiotemporal bilinear operation to interpolate the missing frames and then expand the feature maps to the frequency domain for residual mapping. Besides data-driven losses, the proposed physics-informed loss guarantees gradient consistency and momentum changes. Their combination contributes to the overall 24% improvements in RMSE. To train the proposed model, we propose a large-scale coastal simulation dataset and use it for model optimization and evaluation. Our method shows superior super-resolution quality and fast computation compared to the state-of-the-art methods.
comment: 13 pages, 12 figures
☆ Statistical and Geometrical properties of regularized Kernel Kullback-Leibler divergence
In this paper, we study the statistical and geometrical properties of the Kullback-Leibler divergence with kernel covariance operators (KKL) introduced by Bach [2022]. Unlike the classical Kullback-Leibler (KL) divergence that involves density ratios, the KKL compares probability distributions through covariance operators (embeddings) in a reproducible kernel Hilbert space (RKHS), and compute the Kullback-Leibler quantum divergence. This novel divergence hence shares parallel but different aspects with both the standard Kullback-Leibler between probability distributions and kernel embeddings metrics such as the maximum mean discrepancy. A limitation faced with the original KKL divergence is its inability to be defined for distributions with disjoint supports. To solve this problem, we propose in this paper a regularised variant that guarantees that the divergence is well defined for all distributions. We derive bounds that quantify the deviation of the regularised KKL to the original one, as well as finite-sample bounds. In addition, we provide a closed-form expression for the regularised KKL, specifically applicable when the distributions consist of finite sets of points, which makes it implementable. Furthermore, we derive a Wasserstein gradient descent scheme of the KKL divergence in the case of discrete distributions, and study empirically its properties to transport a set of points to a target distribution.
☆ SALSA: Speedy ASR-LLM Synchronous Aggregation INTERSPEECH 2024
Harnessing pre-trained LLMs to improve ASR systems, particularly for low-resource languages, is now an emerging area of research. Existing methods range from using LLMs for ASR error correction to tightly coupled systems that replace the ASR decoder with the LLM. These approaches either increase decoding time or require expensive training of the cross-attention layers. We propose SALSA, which couples the decoder layers of the ASR to the LLM decoder, while synchronously advancing both decoders. Such coupling is performed with a simple projection of the last decoder state, and is thus significantly more training efficient than earlier approaches. A challenge of our proposed coupling is handling the mismatch between the tokenizers of the LLM and ASR systems. We handle this mismatch using cascading tokenization with respect to the LLM and ASR vocabularies. We evaluate SALSA on 8 low-resource languages in the FLEURS benchmark, yielding substantial WER reductions of up to 38%.
comment: Accepted to INTERSPEECH 2024
☆ SFR-GNN: Simple and Fast Robust GNNs against Structural Attacks
Graph Neural Networks (GNNs) have demonstrated commendable performance for graph-structured data. Yet, GNNs are often vulnerable to adversarial structural attacks as embedding generation relies on graph topology. Existing efforts are dedicated to purifying the maliciously modified structure or applying adaptive aggregation, thereby enhancing the robustness against adversarial structural attacks. It is inevitable for a defender to consume heavy computational costs due to lacking prior knowledge about modified structures. To this end, we propose an efficient defense method, called Simple and Fast Robust Graph Neural Network (SFR-GNN), supported by mutual information theory. The SFR-GNN first pre-trains a GNN model using node attributes and then fine-tunes it over the modified graph in the manner of contrastive learning, which is free of purifying modified structures and adaptive aggregation, thus achieving great efficiency gains. Consequently, SFR-GNN exhibits a 24%--162% speedup compared to advanced robust models, demonstrating superior robustness for node classification tasks.
☆ TinyTNAS: GPU-Free, Time-Bound, Hardware-Aware Neural Architecture Search for TinyML Time Series Classification
In this work, we present TinyTNAS, a novel hardware-aware multi-objective Neural Architecture Search (NAS) tool specifically designed for TinyML time series classification. Unlike traditional NAS methods that rely on GPU capabilities, TinyTNAS operates efficiently on CPUs, making it accessible for a broader range of applications. Users can define constraints on RAM, FLASH, and MAC operations to discover optimal neural network architectures within these parameters. Additionally, the tool allows for time-bound searches, ensuring the best possible model is found within a user-specified duration. By experimenting with benchmark dataset UCI HAR, PAMAP2, WISDM, MIT BIH, and PTB Diagnostic ECG Databas TinyTNAS demonstrates state-of-the-art accuracy with significant reductions in RAM, FLASH, MAC usage, and latency. For example, on the UCI HAR dataset, TinyTNAS achieves a 12x reduction in RAM usage, a 144x reduction in MAC operations, and a 78x reduction in FLASH memory while maintaining superior accuracy and reducing latency by 149x. Similarly, on the PAMAP2 and WISDM datasets, it achieves a 6x reduction in RAM usage, a 40x reduction in MAC operations, an 83x reduction in FLASH, and a 67x reduction in latency, all while maintaining superior accuracy. Notably, the search process completes within 10 minutes in a CPU environment. These results highlight TinyTNAS's capability to optimize neural network architectures effectively for resource-constrained TinyML applications, ensuring both efficiency and high performance. The code for TinyTNAS is available at the GitHub repository and can be accessed at https://github.com/BidyutSaha/TinyTNAS.git.
☆ WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling
Language models have been effectively applied to modeling natural signals, such as images, video, speech, and audio. A crucial component of these models is the codec tokenizer, which compresses high-dimensional natural signals into lower-dimensional discrete tokens. In this paper, we introduce WavTokenizer, which offers several advantages over previous SOTA acoustic codec models in the audio domain: 1)extreme compression. By compressing the layers of quantizers and the temporal dimension of the discrete codec, one-second audio of 24kHz sampling rate requires only a single quantizer with 40 or 75 tokens. 2)improved subjective quality. Despite the reduced number of tokens, WavTokenizer achieves state-of-the-art reconstruction quality with outstanding UTMOS scores and inherently contains richer semantic information. Specifically, we achieve these results by designing a broader VQ space, extended contextual windows, and improved attention networks, as well as introducing a powerful multi-scale discriminator and an inverse Fourier transform structure. We conducted extensive reconstruction experiments in the domains of speech, audio, and music. WavTokenizer exhibited strong performance across various objective and subjective metrics compared to state-of-the-art models. We also tested semantic information, VQ utilization, and adaptability to generative models. Comprehensive ablation studies confirm the necessity of each module in WavTokenizer. The related code, demos, and pre-trained models are available at https://github.com/jishengpeng/WavTokenizer.
comment: Working in progress. arXiv admin note: text overlap with arXiv:2402.12208
☆ Multitask learning for improved scour detection: A dynamic wave tank study
Population-based structural health monitoring (PBSHM), aims to share information between members of a population. An offshore wind (OW) farm could be considered as a population of nominally-identical wind-turbine structures. However, benign variations exist among members, such as geometry, sea-bed conditions and temperature differences. These factors could influence structural properties and therefore the dynamic response, making it more difficult to detect structural problems via traditional SHM techniques. This paper explores the use of a Bayesian hierarchical model as a means of multitask learning, to infer foundation stiffness distribution parameters at both population and local levels. To do this, observations of natural frequency from populations of structures were first generated from both numerical and experimental models. These observations were then used in a partially-pooled Bayesian hierarchical model in tandem with surrogate FE models of the structures to infer foundation stiffness parameters. Finally, it is demonstrated how the learned parameters may be used as a basis to perform more robust anomaly detection (as compared to a no-pooling approach) e.g. as a result of scour.
comment: 25 pages, 12 figures, early work features in ISWHM 2023 conference proceedings and available here: arXiv:2402.19295. Submitted to the Renewable Energy journal
☆ Adaptive Variational Continual Learning via Task-Heuristic Modelling
Variational continual learning (VCL) is a turn-key learning algorithm that has state-of-the-art performance among the best continual learning models. In our work, we explore an extension of the generalized variational continual learning (GVCL) model, named AutoVCL, which combines task heuristics for informed learning and model optimization. We demonstrate that our model outperforms the standard GVCL with fixed hyperparameters, benefiting from the automatic adjustment of the hyperparameter based on the difficulty and similarity of the incoming task compared to the previous tasks.
comment: 4 pages, 2 figures, 3 tables
☆ On-device AI: Quantization-aware Training of Transformers in Time-Series
Artificial Intelligence (AI) models for time-series in pervasive computing keep getting larger and more complicated. The Transformer model is by far the most compelling of these AI models. However, it is difficult to obtain the desired performance when deploying such a massive model on a sensor device with limited resources. My research focuses on optimizing the Transformer model for time-series forecasting tasks. The optimized model will be deployed as hardware accelerators on embedded Field Programmable Gate Arrays (FPGAs). I will investigate the impact of applying Quantization-aware Training to the Transformer model to reduce its size and runtime memory footprint while maximizing the advantages of FPGAs.
comment: This paper is accepted by 2023 IEEE International Conference on Pervasive Computing and Communications(PhD Forum)
☆ An Exploratory Deep Learning Approach for Predicting Subsequent Suicidal Acts in Chinese Psychological Support Hotlines
Psychological support hotlines are an effective suicide prevention measure that typically relies on professionals using suicide risk assessment scales to predict individual risk scores. However, the accuracy of scale-based predictive methods for suicide risk assessment can vary widely depending on the expertise of the operator. This limitation underscores the need for more reliable methods, prompting this research's innovative exploration of the use of artificial intelligence to improve the accuracy and efficiency of suicide risk prediction within the context of psychological support hotlines. The study included data from 1,549 subjects from 2015-2017 in China who contacted a psychological support hotline. Each participant was followed for 12 months to identify instances of suicidal behavior. We proposed a novel multi-task learning method that uses the large-scale pre-trained model Whisper for feature extraction and fits psychological scales while predicting the risk of suicide. The proposed method yields a 2.4\% points improvement in F1-score compared to the traditional manual approach based on the psychological scales. Our model demonstrated superior performance compared to the other eight popular models. To our knowledge, this study is the first to apply deep learning to long-term speech data to predict suicide risk in China, indicating grate potential for clinical applications. The source code is publicly available at: \url{https://github.com/songchangwei/Suicide-Risk-Prediction}.
☆ HYGENE: A Diffusion-based Hypergraph Generation Method
Hypergraphs are powerful mathematical structures that can model complex, high-order relationships in various domains, including social networks, bioinformatics, and recommender systems. However, generating realistic and diverse hypergraphs remains challenging due to their inherent complexity and lack of effective generative models. In this paper, we introduce a diffusion-based Hypergraph Generation (HYGENE) method that addresses these challenges through a progressive local expansion approach. HYGENE works on the bipartite representation of hypergraphs, starting with a single pair of connected nodes and iteratively expanding it to form the target hypergraph. At each step, nodes and hyperedges are added in a localized manner using a denoising diffusion process, which allows for the construction of the global structure before refining local details. Our experiments demonstrated the effectiveness of HYGENE, proving its ability to closely mimic a variety of properties in hypergraphs. To the best of our knowledge, this is the first attempt to employ deep learning models for hypergraph generation, and our work aims to lay the groundwork for future research in this area.
comment: arXiv admin note: text overlap with arXiv:2312.11529 by other authors
☆ Do Recommender Systems Promote Local Music? A Reproducibility Study Using Music Streaming Data
This paper examines the influence of recommender systems on local music representation, discussing prior findings from an empirical study on the LFM-2b public dataset. This prior study argued that different recommender systems exhibit algorithmic biases shifting music consumption either towards or against local content. However, LFM-2b users do not reflect the diverse audience of music streaming services. To assess the robustness of this study's conclusions, we conduct a comparative analysis using proprietary listening data from a global music streaming service, which we publicly release alongside this paper. We observe significant differences in local music consumption patterns between our dataset and LFM-2b, suggesting that caution should be exercised when drawing conclusions on local music based solely on LFM-2b. Moreover, we show that the algorithmic biases exhibited in the original work vary in our dataset, and that several unexplored model parameters can significantly influence these biases and affect the study's conclusion on both datasets. Finally, we discuss the complexity of accurately labeling local music, emphasizing the risk of misleading conclusions due to unreliable, biased, or incomplete labels. To encourage further research and ensure reproducibility, we have publicly shared our dataset and code.
☆ Gradient-free variational learning with conditional mixture networks
Balancing computational efficiency with robust predictive performance is crucial in supervised learning, especially for critical applications. Standard deep learning models, while accurate and scalable, often lack probabilistic features like calibrated predictions and uncertainty quantification. Bayesian methods address these issues but can be computationally expensive as model and data complexity increase. Previous work shows that fast variational methods can reduce the compute requirements of Bayesian methods by eliminating the need for gradient computation or sampling, but are often limited to simple models. We demonstrate that conditional mixture networks (CMNs), a probabilistic variant of the mixture-of-experts (MoE) model, are suitable for fast, gradient-free inference and can solve complex classification tasks. CMNs employ linear experts and a softmax gating network. By exploiting conditional conjugacy and P\'olya-Gamma augmentation, we furnish Gaussian likelihoods for the weights of both the linear experts and the gating network. This enables efficient variational updates using coordinate ascent variational inference (CAVI), avoiding traditional gradient-based optimization. We validate this approach by training two-layer CMNs on standard benchmarks from the UCI repository. Our method, CAVI-CMN, achieves competitive and often superior predictive accuracy compared to maximum likelihood estimation (MLE) with backpropagation, while maintaining competitive runtime and full posterior distributions over all model parameters. Moreover, as input size or the number of experts increases, computation time scales competitively with MLE and other gradient-based solutions like black-box variational inference (BBVI), making CAVI-CMN a promising tool for deep, fast, and gradient-free Bayesian networks.
comment: 16 pages main text (3 figures), including references. 9 pages supplementary material (5 figures)
☆ A Comparative Study of Hyperparameter Tuning Methods
The study emphasizes the challenge of finding the optimal trade-off between bias and variance, especially as hyperparameter optimization increases in complexity. Through empirical analysis, three hyperparameter tuning algorithms Tree-structured Parzen Estimator (TPE), Genetic Search, and Random Search are evaluated across regression and classification tasks. The results show that nonlinear models, with properly tuned hyperparameters, significantly outperform linear models. Interestingly, Random Search excelled in regression tasks, while TPE was more effective for classification tasks. This suggests that there is no one-size-fits-all solution, as different algorithms perform better depending on the task and model type. The findings underscore the importance of selecting the appropriate tuning method and highlight the computational challenges involved in optimizing machine learning models, particularly as search spaces expand.
comment: This chapter has been accepted in the edited volume titles "Data Science in Theory and Practice", editor J Sen & S Roy Choudhury. The volume is expected to be published in October 2024 by Cambridge Scholars Publishing, New Castle upon Tyne, UK. This chapter is 34 pages long and it contains 11 tables and 8 images
☆ Fourier Spectral Physics Informed Neural Network: An Efficient and Low-Memory PINN
With growing investigations into solving partial differential equations by physics-informed neural networks (PINNs), more accurate and efficient PINNs are required to meet the practical demands of scientific computing. One bottleneck of current PINNs is computing the high-order derivatives via automatic differentiation which often necessitates substantial computing resources. In this paper, we focus on removing the automatic differentiation of the spatial derivatives and propose a spectral-based neural network that substitutes the differential operator with a multiplication. Compared to the PINNs, our approach requires lower memory and shorter training time. Thanks to the exponential convergence of the spectral basis, our approach is more accurate. Moreover, to handle the different situations between physics domain and spectral domain, we provide two strategies to train networks by their spectral information. Through a series of comprehensive experiments, We validate the aforementioned merits of our proposed network.
☆ DeepSPoC: A Deep Learning-Based PDE Solver Governed by Sequential Propagation of Chaos
Sequential propagation of chaos (SPoC) is a recently developed tool to solve mean-field stochastic differential equations and their related nonlinear Fokker-Planck equations. Based on the theory of SPoC, we present a new method (deepSPoC) that combines the interacting particle system of SPoC and deep learning. Under the framework of deepSPoC, two classes of frequently used deep models include fully connected neural networks and normalizing flows are considered. For high-dimensional problems, spatial adaptive method are designed to further improve the accuracy and efficiency of deepSPoC. We analysis the convergence of the framework of deepSPoC under some simplified conditions and also provide a posterior error estimation for the algorithm. Finally, we test our methods on a wide range of different types of mean-field equations.
☆ Illuminating the Diversity-Fitness Trade-Off in Black-Box Optimization
In real-world applications, users often favor structurally diverse design choices over one high-quality solution. It is hence important to consider more solutions that decision-makers can compare and further explore based on additional criteria. Alongside the existing approaches of evolutionary diversity optimization, quality diversity, and multimodal optimization, this paper presents a fresh perspective on this challenge by considering the problem of identifying a fixed number of solutions with a pairwise distance above a specified threshold while maximizing their average quality. We obtain first insight into these objectives by performing a subset selection on the search trajectories of different well-established search heuristics, whether specifically designed with diversity in mind or not. We emphasize that the main goal of our work is not to present a new algorithm but to look at the problem in a more fundamental and theoretically tractable way by asking the question: What trade-off exists between the minimum distance within batches of solutions and the average quality of their fitness? These insights also provide us with a way of making general claims concerning the properties of optimization problems that shall be useful in turn for benchmarking algorithms of the approaches enumerated above. A possibly surprising outcome of our empirical study is the observation that naive uniform random sampling establishes a very strong baseline for our problem, hardly ever outperformed by the search trajectories of the considered heuristics. We interpret these results as a motivation to develop algorithms tailored to produce diverse solutions of high average quality.
☆ TempoKGAT: A Novel Graph Attention Network Approach for Temporal Graph Analysis
Graph neural networks (GNN) have shown significant capabilities in handling structured data, yet their application to dynamic, temporal data remains limited. This paper presents a new type of graph attention network, called TempoKGAT, which combines time-decaying weight and a selective neighbor aggregation mechanism on the spatial domain, which helps uncover latent patterns in the graph data. In this approach, a top-k neighbor selection based on the edge weights is introduced to represent the evolving features of the graph data. We evaluated the performance of our TempoKGAT on multiple datasets from the traffic, energy, and health sectors involving spatio-temporal data. We compared the performance of our approach to several state-of-the-art methods found in the literature on several open-source datasets. Our method shows superior accuracy on all datasets. These results indicate that TempoKGAT builds on existing methodologies to optimize prediction accuracy and provide new insights into model interpretation in temporal contexts.
☆ Addressing Common Misinterpretations of KART and UAT in Neural Network Literature
This note addresses the Kolmogorov-Arnold Representation Theorem (KART) and the Universal Approximation Theorem (UAT), focusing on their common misinterpretations in some papers related to neural network approximation. Our remarks aim to support a more accurate understanding of KART and UAT among neural network specialists.
comment: 6 pages
☆ TG-PhyNN: An Enhanced Physically-Aware Graph Neural Network framework for forecasting Spatio-Temporal Data
Accurately forecasting dynamic processes on graphs, such as traffic flow or disease spread, remains a challenge. While Graph Neural Networks (GNNs) excel at modeling and forecasting spatio-temporal data, they often lack the ability to directly incorporate underlying physical laws. This work presents TG-PhyNN, a novel Temporal Graph Physics-Informed Neural Network framework. TG-PhyNN leverages the power of GNNs for graph-based modeling while simultaneously incorporating physical constraints as a guiding principle during training. This is achieved through a two-step prediction strategy that enables the calculation of physical equation derivatives within the GNN architecture. Our findings demonstrate that TG-PhyNN significantly outperforms traditional forecasting models (e.g., GRU, LSTM, GAT) on real-world spatio-temporal datasets like PedalMe (traffic flow), COVID-19 spread, and Chickenpox outbreaks. These datasets are all governed by well-defined physical principles, which TG-PhyNN effectively exploits to offer more reliable and accurate forecasts in various domains where physical processes govern the dynamics of data. This paves the way for improved forecasting in areas like traffic flow prediction, disease outbreak prediction, and potentially other fields where physics plays a crucial role.
☆ Machine learning models for daily rainfall forecasting in Northern Tropical Africa using tropical wave predictors
Numerical weather prediction (NWP) models often underperform compared to simpler climatology-based precipitation forecasts in northern tropical Africa, even after statistical postprocessing. AI-based forecasting models show promise but have avoided precipitation due to its complexity. Synoptic-scale forcings like African easterly waves and other tropical waves (TWs) are important for predictability in tropical Africa, yet their value for predicting daily rainfall remains unexplored. This study uses two machine-learning models--gamma regression and a convolutional neural network (CNN)--trained on TW predictors from satellite-based GPM IMERG data to predict daily rainfall during the July-September monsoon season. Predictor variables are derived from the local amplitude and phase information of seven TW from the target and up-and-downstream neighboring grids at 1-degree spatial resolution. The ML models are combined with Easy Uncertainty Quantification (EasyUQ) to generate calibrated probabilistic forecasts and are compared with three benchmarks: Extended Probabilistic Climatology (EPC15), ECMWF operational ensemble forecast (ENS), and a probabilistic forecast from the ENS control member using EasyUQ (CTRL EasyUQ). The study finds that downstream predictor variables offer the highest predictability, with downstream tropical depression (TD)-type wave-based predictors being most important. Other waves like mixed-Rossby gravity (MRG), Kelvin, and inertio-gravity waves also contribute significantly but show regional preferences. ENS forecasts exhibit poor skill due to miscalibration. CTRL EasyUQ shows improvement over ENS and marginal enhancement over EPC15. Both gamma regression and CNN forecasts significantly outperform benchmarks in tropical Africa. This study highlights the potential of ML models trained on TW-based predictors to improve daily precipitation forecasts in tropical Africa.
☆ Do Graph Neural Networks Work for High Entropy Alloys?
Graph neural networks (GNNs) have excelled in predictive modeling for both crystals and molecules, owing to the expressiveness of graph representations. High-entropy alloys (HEAs), however, lack chemical long-range order, limiting the applicability of current graph representations. To overcome this challenge, we propose a representation of HEAs as a collection of local environment (LE) graphs. Based on this representation, we introduce the LESets machine learning model, an accurate, interpretable GNN for HEA property prediction. We demonstrate the accuracy of LESets in modeling the mechanical properties of quaternary HEAs. Through analyses and interpretation, we further extract insights into the modeling and design of HEAs. In a broader sense, LESets extends the potential applicability of GNNs to disordered materials with combinatorial complexity formed by diverse constituents and their flexible configurations.
☆ GL-TSVM: A robust and smooth twin support vector machine with guardian loss function
Twin support vector machine (TSVM), a variant of support vector machine (SVM), has garnered significant attention due to its $3/4$ times lower computational complexity compared to SVM. However, due to the utilization of the hinge loss function, TSVM is sensitive to outliers or noise. To remedy it, we introduce the guardian loss (G-loss), a novel loss function distinguished by its asymmetric, bounded, and smooth characteristics. We then fuse the proposed G-loss function into the TSVM and yield a robust and smooth classifier termed GL-TSVM. Further, to adhere to the structural risk minimization (SRM) principle and reduce overfitting, we incorporate a regularization term into the objective function of GL-TSVM. To address the optimization challenges of GL-TSVM, we devise an efficient iterative algorithm. The experimental analysis on UCI and KEEL datasets substantiates the effectiveness of the proposed GL-TSVM in comparison to the baseline models. Moreover, to showcase the efficacy of the proposed GL-TSVM in the biomedical domain, we evaluated it on the breast cancer (BreaKHis) and schizophrenia datasets. The outcomes strongly demonstrate the competitiveness of the proposed GL-TSVM against the baseline models.
comment: arXiv admin note: text overlap with arXiv:2404.18101
☆ Self-Improving Diffusion Models with Synthetic Data
The artificial intelligence (AI) world is running out of real data for training increasingly large generative models, resulting in accelerating pressure to train on synthetic data. Unfortunately, training new generative models with synthetic data from current or past generation models creates an autophagous (self-consuming) loop that degrades the quality and/or diversity of the synthetic data in what has been termed model autophagy disorder (MAD) and model collapse. Current thinking around model autophagy recommends that synthetic data is to be avoided for model training lest the system deteriorate into MADness. In this paper, we take a different tack that treats synthetic data differently from real data. Self-IMproving diffusion models with Synthetic data (SIMS) is a new training concept for diffusion models that uses self-synthesized data to provide negative guidance during the generation process to steer a model's generative process away from the non-ideal synthetic data manifold and towards the real data distribution. We demonstrate that SIMS is capable of self-improvement; it establishes new records based on the Fr\'echet inception distance (FID) metric for CIFAR-10 and ImageNet-64 generation and achieves competitive results on FFHQ-64 and ImageNet-512. Moreover, SIMS is, to the best of our knowledge, the first prophylactic generative AI algorithm that can be iteratively trained on self-generated synthetic data without going MAD. As a bonus, SIMS can adjust a diffusion model's synthetic data distribution to match any desired in-domain target distribution to help mitigate biases and ensure fairness.
☆ Minimising changes to audit when updating decision trees
Interpretable models are important, but what happens when the model is updated on new training data? We propose an algorithm for updating a decision tree while minimising the number of changes to the tree that a human would need to audit. We achieve this via a greedy approach that incorporates the number of changes to the tree as part of the objective function. We compare our algorithm to existing methods and show that it sits in a sweet spot between final accuracy and number of changes to audit.
comment: 12 pages
☆ Passenger hazard perception based on EEG signals for highly automated driving vehicles
Enhancing the safety of autonomous vehicles is crucial, especially given recent accidents involving automated systems. As passengers in these vehicles, humans' sensory perception and decision-making can be integrated with autonomous systems to improve safety. This study explores neural mechanisms in passenger-vehicle interactions, leading to the development of a Passenger Cognitive Model (PCM) and the Passenger EEG Decoding Strategy (PEDS). Central to PEDS is a novel Convolutional Recurrent Neural Network (CRNN) that captures spatial and temporal EEG data patterns. The CRNN, combined with stacking algorithms, achieves an accuracy of $85.0\% \pm 3.18\%$. Our findings highlight the predictive power of pre-event EEG data, enhancing the detection of hazardous scenarios and offering a network-driven framework for safer autonomous vehicles.
☆ Physics of Language Models: Part 2.2, How to Learn From Mistakes on Grade-School Math Problems
Language models have demonstrated remarkable performance in solving reasoning tasks; however, even the strongest models still occasionally make reasoning mistakes. Recently, there has been active research aimed at improving reasoning accuracy, particularly by using pretrained language models to "self-correct" their mistakes via multi-round prompting. In this paper, we follow this line of work but focus on understanding the usefulness of incorporating "error-correction" data directly into the pretraining stage. This data consists of erroneous solution steps immediately followed by their corrections. Using a synthetic math dataset, we show promising results: this type of pretrain data can help language models achieve higher reasoning accuracy directly (i.e., through simple auto-regression, without multi-round prompting) compared to pretraining on the same amount of error-free data. We also delve into many details, such as (1) how this approach differs from beam search, (2) how such data can be prepared, (3) whether masking is needed on the erroneous tokens, (4) the amount of error required, (5) whether such data can be deferred to the fine-tuning stage, and many others.
comment: arXiv admin note: text overlap with arXiv:2407.20311
☆ Flexible framework for generating synthetic electrocardiograms and photoplethysmograms
By generating synthetic biosignals, the quantity and variety of health data can be increased. This is especially useful when training machine learning models by enabling data augmentation and introduction of more physiologically plausible variation to the data. For these purposes, we have developed a synthetic biosignal model for two signal modalities, electrocardiography (ECG) and photoplethysmography (PPG). The model produces realistic signals that account for physiological effects such as breathing modulation and changes in heart rate due to physical stress. Arrhythmic signals can be generated with beat intervals extracted from real measurements. The model also includes a flexible approach to adding different kinds of noise and signal artifacts. The noise is generated from power spectral densities extracted from both measured noisy signals and modeled power spectra. Importantly, the model also automatically produces labels for noise, segmentation (e.g. P and T waves, QRS complex, for electrocardiograms), and artifacts. We assessed how this comprehensive model can be used in practice to improve the performance of models trained on ECG or PPG data. For example, we trained an LSTM to detect ECG R-peaks using both real ECG signals from the MIT-BIH arrythmia set and our new generator. The F1 score of the model was 0.83 using real data, in comparison to 0.98 using our generator. In addition, the model can be used for example in signal segmentation, quality detection and bench-marking detection algorithms. The model code has been released in \url{https://github.com/UTU-Health-Research/framework_for_synthetic_biosignals}
☆ OpenFGL: A Comprehensive Benchmarks for Federated Graph Learning
Federated graph learning (FGL) has emerged as a promising distributed training paradigm for graph neural networks across multiple local systems without direct data sharing. This approach is particularly beneficial in privacy-sensitive scenarios and offers a new perspective on addressing scalability challenges in large-scale graph learning. Despite the proliferation of FGL, the diverse motivations from practical applications, spanning various research backgrounds and experimental settings, pose a significant challenge to fair evaluation. To fill this gap, we propose OpenFGL, a unified benchmark designed for the primary FGL scenarios: Graph-FL and Subgraph-FL. Specifically, OpenFGL includes 38 graph datasets from 16 application domains, 8 federated data simulation strategies that emphasize graph properties, and 5 graph-based downstream tasks. Additionally, it offers 18 recently proposed SOTA FGL algorithms through a user-friendly API, enabling a thorough comparison and comprehensive evaluation of their effectiveness, robustness, and efficiency. Empirical results demonstrate the ability of FGL while also revealing its potential limitations, offering valuable insights for future exploration in this thriving field.
comment: Under Review
☆ Near-Optimal Policy Identification in Robust Constrained Markov Decision Processes via Epigraph Form
Designing a safe policy for uncertain environments is crucial in real-world control applications. However, this challenge remains inadequately addressed within the Markov decision process (MDP) framework. This paper presents the first algorithm capable of identifying a near-optimal policy in a robust constrained MDP (RCMDP), where an optimal policy minimizes cumulative cost while satisfying constraints in the worst-case scenario across a set of environments. We first prove that the conventional Lagrangian max-min formulation with policy gradient methods can become trapped in suboptimal solutions by encountering a sum of conflicting gradients from the objective and constraint functions during its inner minimization problem. To address this, we leverage the epigraph form of the RCMDP problem, which resolves the conflict by selecting a single gradient from either the objective or the constraints. Building on the epigraph form, we propose a binary search algorithm with a policy gradient subroutine and prove that it identifies an $\varepsilon$-optimal policy in an RCMDP with $\tilde{\mathcal{O}}(\varepsilon^{-4})$ policy evaluations.
☆ ART: Actually Robust Training
Current interest in deep learning captures the attention of many programmers and researchers. Unfortunately, the lack of a unified schema for developing deep learning models results in methodological inconsistencies, unclear documentation, and problems with reproducibility. Some guidelines have been proposed, yet currently, they lack practical implementations. Furthermore, neural network training often takes on the form of trial and error, lacking a structured and thoughtful process. To alleviate these issues, in this paper, we introduce Art, a Python library designed to help automatically impose rules and standards while developing deep learning pipelines. Art divides model development into a series of smaller steps of increasing complexity, each concluded with a validation check improving the interpretability and robustness of the process. The current version of Art comes equipped with nine predefined steps inspired by Andrej Karpathy's Recipe for Training Neural Networks, a visualization dashboard, and integration with loggers such as Neptune. The code related to this paper is available at: https://github.com/SebChw/Actually-Robust-Training.
☆ Enhancing Customer Churn Prediction in Telecommunications: An Adaptive Ensemble Learning Approach
Customer churn, the discontinuation of services by existing customers, poses a significant challenge to the telecommunications industry. This paper proposes a novel adaptive ensemble learning framework for highly accurate customer churn prediction. The framework integrates multiple base models, including XGBoost, LightGBM, LSTM, a Multi-Layer Perceptron (MLP) neural network, and Support Vector Machine (SVM). These models are strategically combined using a stacking ensemble method, further enhanced by meta-feature generation from base model predictions. A rigorous data preprocessing pipeline, coupled with a multi-faceted feature engineering approach, optimizes model performance. The framework is evaluated on three publicly available telecom churn datasets, demonstrating substantial accuracy improvements over state-of-the-art techniques. The research achieves a remarkable 99.28% accuracy, signifying a major advancement in churn prediction.The implications of this research for developing proactive customer retention strategies withinthe telecommunications industry are discussed.
comment: 12 pages,2 figures
☆ Web Service QoS Prediction via Extended Canonical Polyadic-based Tensor Network
Today, numerous web services with similar functionalities are available on the Internet. Users often evaluate the Quality of Service (QoS) to choose the best option among them. Predicting the QoS values of these web services is a significant challenge in the field of web services. A Canonical Polyadic (CP)-based tensor network model has proven to be efficient for predicting dynamic QoS data. However, current CP-based tensor network models do not consider the correlation of users and services in the low-dimensional latent feature space, thereby limiting model's prediction capability. To tackle this issue, this paper proposes an Extended Canonical polyadic-based Tensor Network (ECTN) model. It models the correlation of users and services via building a relation dimension between user feature and service feature in low-dimensional space, and then designs an extended CP decomposition structure to improve prediction accuracy. Experiments are conducted on two public dynamic QoS data, and the results show that compared with state-of-the-art QoS prediction models, the ECTN obtains higher prediction accuracy.
comment: 10 pages, 6 figures
☆ On Convergence of Average-Reward Q-Learning in Weakly Communicating Markov Decision Processes
This paper analyzes reinforcement learning (RL) algorithms for Markov decision processes (MDPs) under the average-reward criterion. We focus on Q-learning algorithms based on relative value iteration (RVI), which are model-free stochastic analogues of the classical RVI method for average-reward MDPs. These algorithms have low per-iteration complexity, making them well-suited for large state space problems. We extend the almost-sure convergence analysis of RVI Q-learning algorithms developed by Abounadi, Bertsekas, and Borkar (2001) from unichain to weakly communicating MDPs. This extension is important both practically and theoretically: weakly communicating MDPs cover a much broader range of applications compared to unichain MDPs, and their optimality equations have a richer solution structure (with multiple degrees of freedom), introducing additional complexity in proving algorithmic convergence. We also characterize the sets to which RVI Q-learning algorithms converge, showing that they are compact, connected, potentially nonconvex, and comprised of solutions to the average-reward optimality equation, with exactly one less degree of freedom than the general solution set of this equation. Furthermore, we extend our analysis to two RVI-based hierarchical average-reward RL algorithms using the options framework, proving their almost-sure convergence and characterizing their sets of convergence under the assumption that the underlying semi-Markov decision process is weakly communicating.
☆ Evaluating Time-Series Training Dataset through Lens of Spectrum in Deep State Space Models
This study investigates a method to evaluate time-series datasets in terms of the performance of deep neural networks (DNNs) with state space models (deep SSMs) trained on the dataset. SSMs have attracted attention as components inside DNNs to address time-series data. Since deep SSMs have powerful representation capacities, training datasets play a crucial role in solving a new task. However, the effectiveness of training datasets cannot be known until deep SSMs are actually trained on them. This can increase the cost of data collection for new tasks, as a trial-and-error process of data collection and time-consuming training are needed to achieve the necessary performance. To advance the practical use of deep SSMs, the metric of datasets to estimate the performance early in the training can be one key element. To this end, we introduce the concept of data evaluation methods used in system identification. In system identification of linear dynamical systems, the effectiveness of datasets is evaluated by using the spectrum of input signals. We introduce this concept to deep SSMs, which are nonlinear dynamical systems. We propose the K-spectral metric, which is the sum of the top-K spectra of signals inside deep SSMs, by focusing on the fact that each layer of a deep SSM can be regarded as a linear dynamical system. Our experiments show that the K-spectral metric has a large absolute value of the correlation coefficient with the performance and can be used to evaluate the quality of training datasets.
comment: 11 pages, 5 figures
☆ Coalitions of AI-based Methods Predict 15-Year Risks of Breast Cancer Metastasis Using Real-World Clinical Data with AUC up to 0.9
Breast cancer is one of the two cancers responsible for the most deaths in women, with about 42,000 deaths each year in the US. That there are over 300,000 breast cancers newly diagnosed each year suggests that only a fraction of the cancers result in mortality. Thus, most of the women undergo seemingly curative treatment for localized cancers, but a significant later succumb to metastatic disease for which current treatments are only temporizing for the vast majority. The current prognostic metrics are of little actionable value for 4 of the 5 women seemingly cured after local treatment, and many women are exposed to morbid and even mortal adjuvant therapies unnecessarily, with these adjuvant therapies reducing metastatic recurrence by only a third. Thus, there is a need for better prognostics to target aggressive treatment at those who are likely to relapse and spare those who were actually cured. While there is a plethora of molecular and tumor-marker assays in use and under-development to detect recurrence early, these are time consuming, expensive and still often un-validated as to actionable prognostic utility. A different approach would use large data techniques to determine clinical and histopathological parameters that would provide accurate prognostics using existing data. Herein, we report on machine learning, together with grid search and Bayesian Networks to develop algorithms that present a AUC of up to 0.9 in ROC analyses, using only extant data. Such algorithms could be rapidly translated to clinical management as they do not require testing beyond routine tumor evaluations.
☆ Iterated Energy-based Flow Matching for Sampling from Boltzmann Densities
In this work, we consider the problem of training a generator from evaluations of energy functions or unnormalized densities. This is a fundamental problem in probabilistic inference, which is crucial for scientific applications such as learning the 3D coordinate distribution of a molecule. To solve this problem, we propose iterated energy-based flow matching (iEFM), the first off-policy approach to train continuous normalizing flow (CNF) models from unnormalized densities. We introduce the simulation-free energy-based flow matching objective, which trains the model to predict the Monte Carlo estimation of the marginal vector field constructed from known energy functions. Our framework is general and can be extended to variance-exploding (VE) and optimal transport (OT) conditional probability paths. We evaluate iEFM on a two-dimensional Gaussian mixture model (GMM) and an eight-dimensional four-particle double-well potential (DW-4) energy function. Our results demonstrate that iEFM outperforms existing methods, showcasing its potential for efficient and scalable probabilistic modeling in complex high-dimensional systems.
☆ PACiM: A Sparsity-Centric Hybrid Compute-in-Memory Architecture via Probabilistic Approximation
Approximate computing emerges as a promising approach to enhance the efficiency of compute-in-memory (CiM) systems in deep neural network processing. However, traditional approximate techniques often significantly trade off accuracy for power efficiency, and fail to reduce data transfer between main memory and CiM banks, which dominates power consumption. This paper introduces a novel probabilistic approximate computation (PAC) method that leverages statistical techniques to approximate multiply-and-accumulation (MAC) operations, reducing approximation error by 4X compared to existing approaches. PAC enables efficient sparsity-based computation in CiM systems by simplifying complex MAC vector computations into scalar calculations. Moreover, PAC enables sparsity encoding and eliminates the LSB activations transmission, significantly reducing data reads and writes. This sets PAC apart from traditional approximate computing techniques, minimizing not only computation power but also memory accesses by 50%, thereby boosting system-level efficiency. We developed PACiM, a sparsity-centric architecture that fully exploits sparsity to reduce bit-serial cycles by 81% and achieves a peak 8b/8b efficiency of 14.63 TOPS/W in 65 nm CMOS while maintaining high accuracy of 93.85/72.36/66.02% on CIFAR-10/CIFAR-100/ImageNet benchmarks using a ResNet-18 model, demonstrating the effectiveness of our PAC methodology.
☆ Large-Scale Multi-omic Biosequence Transformers for Modeling Peptide-Nucleotide Interactions
The transformer architecture has revolutionized bioinformatics and driven progress in the understanding and prediction of the properties of biomolecules. Almost all research on large-scale biosequence transformers has focused on one domain at a time (single-omic), usually nucleotides or peptides. These models have seen incredible success in downstream tasks in each domain and have achieved particularly noteworthy breakthroughs in sequences of peptides and structural modeling. However, these single-omic models are naturally incapable of modeling multi-omic tasks, one of the most biologically critical being nucleotide-peptide interactions. We present our work training the first multi-omic nucleotide-peptide foundation models. We show that these multi-omic models (MOMs) can learn joint representations between various single-omic distributions that are emergently consistent with the Central Dogma of molecular biology, despite only being trained on unlabeled biosequences. We further demonstrate that MOMs can be fine-tuned to achieve state-of-the-art results on peptide-nucleotide interaction tasks, namely predicting the change in Gibbs free energy ({\Delta}G) of the binding interaction between a given oligonucleotide and peptide, as well as the effect on this binding interaction due to mutations in the oligonucleotide sequence ({\Delta}{\Delta}G). Remarkably, we show that multi-omic biosequence transformers emergently learn useful structural information without any prior structural training, allowing us to predict which peptide residues are most involved in the peptide-nucleotide binding interaction. Lastly, we provide evidence that multi-omic biosequence models are non-inferior to foundation models trained on single-omics distributions, suggesting a more generalized or foundational approach to building these models.
comment: 27 pages, 5 figures
☆ Enhancing Conditional Image Generation with Explainable Latent Space Manipulation
In the realm of image synthesis, achieving fidelity to a reference image while adhering to conditional prompts remains a significant challenge. This paper proposes a novel approach that integrates a diffusion model with latent space manipulation and gradient-based selective attention mechanisms to address this issue. Leveraging Grad-SAM (Gradient-based Selective Attention Manipulation), we analyze the cross attention maps of the cross attention layers and gradients for the denoised latent vector, deriving importance scores of elements of denoised latent vector related to the subject of interest. Using this information, we create masks at specific timesteps during denoising to preserve subjects while seamlessly integrating the reference image features. This approach ensures the faithful formation of subjects based on conditional prompts, while concurrently refining the background for a more coherent composition. Our experiments on places365 dataset demonstrate promising results, with our proposed model achieving the lowest mean and median Frechet Inception Distance (FID) scores compared to baseline models, indicating superior fidelity preservation. Furthermore, our model exhibits competitive performance in aligning the generated images with provided textual descriptions, as evidenced by high CLIP scores. These results highlight the effectiveness of our approach in both fidelity preservation and textual context preservation, offering a significant advancement in text-to-image synthesis tasks.
comment: 7 pages , 5 figures
☆ Policy Adaptation via Language Optimization: Decomposing Tasks for Few-Shot Imitation
Learned language-conditioned robot policies often struggle to effectively adapt to new real-world tasks even when pre-trained across a diverse set of instructions. We propose a novel approach for few-shot adaptation to unseen tasks that exploits the semantic understanding of task decomposition provided by vision-language models (VLMs). Our method, Policy Adaptation via Language Optimization (PALO), combines a handful of demonstrations of a task with proposed language decompositions sampled from a VLM to quickly enable rapid nonparametric adaptation, avoiding the need for a larger fine-tuning dataset. We evaluate PALO on extensive real-world experiments consisting of challenging unseen, long-horizon robot manipulation tasks. We find that PALO is able of consistently complete long-horizon, multi-tier tasks in the real world, outperforming state of the art pre-trained generalist policies, and methods that have access to the same demonstrations.
comment: 27 pages, 14 figures
☆ Targeted Cause Discovery with Data-Driven Learning
We propose a novel machine learning approach for inferring causal variables of a target variable from observations. Our goal is to identify both direct and indirect causes within a system, thereby efficiently regulating the target variable when the difficulty and cost of intervening on each causal variable vary. Our method employs a neural network trained to identify causality through supervised learning on simulated data. By implementing a local-inference strategy, we achieve linear complexity with respect to the number of variables, efficiently scaling up to thousands of variables. Empirical results demonstrate the effectiveness of our method in identifying causal relationships within large-scale gene regulatory networks, outperforming existing causal discovery methods that primarily focus on direct causality. We validate our model's generalization capability across novel graph structures and generating mechanisms, including gene regulatory networks of E. coli and the human K562 cell line. Implementation codes are available at https://github.com/snu-mllab/Targeted-Cause-Discovery.
comment: preprint
☆ Adversarial Network Optimization under Bandit Feedback: Maximizing Utility in Non-Stationary Multi-Hop Networks
Stochastic Network Optimization (SNO) concerns scheduling in stochastic queueing systems. It has been widely studied in network theory. Classical SNO algorithms require network conditions to be stationary with time, which fails to capture the non-stationary components in many real-world scenarios. Many existing algorithms also assume knowledge of network conditions before decision, which rules out applications where unpredictability presents. Motivated by these issues, we consider Adversarial Network Optimization (ANO) under bandit feedback. Specifically, we consider the task of *i)* maximizing some unknown and time-varying utility function associated to scheduler's actions, where *ii)* the underlying network is a non-stationary multi-hop one whose conditions change arbitrarily with time, and *iii)* only bandit feedback (effect of actually deployed actions) is revealed after decisions. Our proposed `UMO2` algorithm ensures network stability and also matches the utility maximization performance of any "mildly varying" reference policy up to a polynomially decaying gap. To our knowledge, no previous ANO algorithm handled multi-hop networks or achieved utility guarantees under bandit feedback, whereas ours can do both. Technically, our method builds upon a novel integration of online learning into Lyapunov analyses: To handle complex inter-dependencies among queues in multi-hop networks, we propose meticulous techniques to balance online learning and Lyapunov arguments. To tackle the learning obstacles due to potentially unbounded queue sizes, we design a new online linear optimization algorithm that automatically adapts to loss magnitudes. To maximize utility, we propose a bandit convex optimization algorithm with novel queue-dependent learning rate scheduling that suites drastically varying queue lengths. Our new insights in online learning can be of independent interest.
☆ The Application of Machine Learning in Tidal Evolution Simulation of Star-Planet Systems
With the release of a large amount of astronomical data, an increasing number of close-in hot Jupiters have been discovered. Calculating their evolutionary curves using star-planet interaction models presents a challenge. To expedite the generation of evolutionary curves for these close-in hot Jupiter systems, we utilized tidal interaction models established on MESA to create 15,745 samples of star-planet systems and 7,500 samples of stars. Additionally, we employed a neural network (Multi-Layer Perceptron - MLP) to predict the evolutionary curves of the systems, including stellar effective temperature, radius, stellar rotation period, and planetary orbital period. The median relative errors of the predicted evolutionary curves were found to be 0.15%, 0.43%, 2.61%, and 0.57%, respectively. Furthermore, the speed at which we generate evolutionary curves exceeds that of model-generated curves by more than four orders of magnitude. We also extracted features of planetary migration states and utilized lightGBM to classify the samples into 6 categories for prediction. We found that by combining three types that undergo long-term double synchronization into one label, the classifier effectively recognized these features. Apart from systems experiencing long-term double synchronization, the median relative errors of the predicted evolutionary curves were all below 4%. Our work provides an efficient method to save significant computational resources and time with minimal loss in accuracy. This research also lays the foundation for analyzing the evolutionary characteristics of systems under different migration states, aiding in the understanding of the underlying physical mechanisms of such systems. Finally, to a large extent, our approach could replace the calculations of theoretical models.
☆ ReXamine-Global: A Framework for Uncovering Inconsistencies in Radiology Report Generation Metrics
Given the rapidly expanding capabilities of generative AI models for radiology, there is a need for robust metrics that can accurately measure the quality of AI-generated radiology reports across diverse hospitals. We develop ReXamine-Global, a LLM-powered, multi-site framework that tests metrics across different writing styles and patient populations, exposing gaps in their generalization. First, our method tests whether a metric is undesirably sensitive to reporting style, providing different scores depending on whether AI-generated reports are stylistically similar to ground-truth reports or not. Second, our method measures whether a metric reliably agrees with experts, or whether metric and expert scores of AI-generated report quality diverge for some sites. Using 240 reports from 6 hospitals around the world, we apply ReXamine-Global to 7 established report evaluation metrics and uncover serious gaps in their generalizability. Developers can apply ReXamine-Global when designing new report evaluation metrics, ensuring their robustness across sites. Additionally, our analysis of existing metrics can guide users of those metrics towards evaluation procedures that work reliably at their sites of interest.
☆ Revisit Micro-batch Clipping: Adaptive Data Pruning via Gradient Manipulation
Micro-batch clipping, a gradient clipping method, has recently shown potential in enhancing auto-speech recognition (ASR) model performance. However, the underlying mechanism behind this improvement remains mysterious, particularly the observation that only certain micro-batch sizes are beneficial. In this paper, we make the first attempt to explain this phenomenon. Inspired by recent data pruning research, we assume that specific training samples may impede model convergence during certain training phases. Under this assumption, the convergence analysis shows that micro-batch clipping can improve the convergence rate asymptotically at the cost of an additional constant bias that does not diminish with more training iterations. The bias is dependent on a few factors and can be minimized at specific micro-batch size, thereby elucidating the existence of the sweet-spot micro-batch size observed previously. We also verify the effectiveness of micro-batch clipping beyond speech models on vision and language models, and show promising performance gains in these domains. An exploration of potential limitations shows that micro-batch clipping is less effective when training data originates from multiple distinct domains.
☆ Short-Term Electricity-Load Forecasting by Deep Learning: A Comprehensive Survey
Short-Term Electricity-Load Forecasting (STELF) refers to the prediction of the immediate demand (in the next few hours to several days) for the power system. Various external factors, such as weather changes and the emergence of new electricity consumption scenarios, can impact electricity demand, causing load data to fluctuate and become non-linear, which increases the complexity and difficulty of STELF. In the past decade, deep learning has been applied to STELF, modeling and predicting electricity demand with high accuracy, and contributing significantly to the development of STELF. This paper provides a comprehensive survey on deep-learning-based STELF over the past ten years. It examines the entire forecasting process, including data pre-processing, feature extraction, deep-learning modeling and optimization, and results evaluation. This paper also identifies some research challenges and potential research directions to be further investigated in future work.
☆ Uni-3DAD: GAN-Inversion Aided Universal 3D Anomaly Detection on Model-free Products
Anomaly detection is a long-standing challenge in manufacturing systems. Traditionally, anomaly detection has relied on human inspectors. However, 3D point clouds have gained attention due to their robustness to environmental factors and their ability to represent geometric data. Existing 3D anomaly detection methods generally fall into two categories. One compares scanned 3D point clouds with design files, assuming these files are always available. However, such assumptions are often violated in many real-world applications where model-free products exist, such as fresh produce (i.e., ``Cookie", ``Potato", etc.), dentures, bone, etc. The other category compares patches of scanned 3D point clouds with a library of normal patches named memory bank. However, those methods usually fail to detect incomplete shapes, which is a fairly common defect type (i.e., missing pieces of different products). The main challenge is that missing areas in 3D point clouds represent the absence of scanned points. This makes it infeasible to compare the missing region with existing point cloud patches in the memory bank. To address these two challenges, we proposed a unified, unsupervised 3D anomaly detection framework capable of identifying all types of defects on model-free products. Our method integrates two detection modules: a feature-based detection module and a reconstruction-based detection module. Feature-based detection covers geometric defects, such as dents, holes, and cracks, while the reconstruction-based method detects missing regions. Additionally, we employ a One-class Support Vector Machine (OCSVM) to fuse the detection results from both modules. The results demonstrate that (1) our proposed method outperforms the state-of-the-art methods in identifying incomplete shapes and (2) it still maintains comparable performance with the SOTA methods in detecting all other types of anomalies.
☆ Variational Mode-Driven Graph Convolutional Network for Spatiotemporal Traffic Forecasting
This paper focuses on spatio-temporal (ST) traffic prediction traffic using graph neural networks. Given that ST data consists of non-stationary and complex time events, interpreting and predicting such trends is comparatively complicated. Representation of ST data in modes helps us infer behavior and assess the impact of noise on prediction applications. We propose a framework that decomposes ST data into modes using the variational mode decomposition (VMD) method, which is then fed into the neural network for forecasting future states. This hybrid approach is known as a variational mode graph convolutional network (VMGCN). Instead of exhaustively searching for the number of modes, they are determined using the reconstruction loss from the real-time application data. We also study the significance of each mode and the impact of bandwidth constraints on different horizon predictions in traffic flow data. We evaluate the performance of our proposed network on the LargeST dataset for both short and long-term predictions. Our framework yields better results compared to state-of-the-art methods.
comment: IEEE Transactions on Intelligent Transportation Systems Submission, 2024
☆ A More Unified Theory of Transfer Learning
We show that some basic moduli of continuity $\delta$ -- which measure how fast target risk decreases as source risk decreases -- appear to be at the root of many of the classical relatedness measures in transfer learning and related literature. Namely, bounds in terms of $\delta$ recover many of the existing bounds in terms of other measures of relatedness -- both in regression and classification -- and can at times be tighter. We are particularly interested in general situations where the learner has access to both source data and some or no target data. The unified perspective allowed by the moduli $\delta$ allow us to extend many existing notions of relatedness at once to these scenarios involving target data: interestingly, while $\delta$ itself might not be efficiently estimated, adaptive procedures exist -- based on reductions to confidence sets -- which can get nearly tight rates in terms of $\delta$ with no prior distributional knowledge. Such adaptivity to unknown $\delta$ immediately implies adaptivity to many classical relatedness notions, in terms of combined source and target samples' sizes.
☆ Real-Time Energy Pricing in New Zealand: An Evolving Stream Analysis PRICAI
This paper introduces a group of novel datasets representing real-time time-series and streaming data of energy prices in New Zealand, sourced from the Electricity Market Information (EMI) website maintained by the New Zealand government. The datasets are intended to address the scarcity of proper datasets for streaming regression learning tasks. We conduct extensive analyses and experiments on these datasets, covering preprocessing techniques, regression tasks, prediction intervals, concept drift detection, and anomaly detection. Our experiments demonstrate the datasets' utility and highlight the challenges and opportunities for future research in energy price forecasting.
comment: 12 Pages, 8 figures, short version accepted by PRICAI
☆ Single-Loop Deterministic and Stochastic Interior-Point Algorithms for Nonlinearly Constrained Optimization
An interior-point algorithm framework is proposed, analyzed, and tested for solving nonlinearly constrained continuous optimization problems. The main setting of interest is when the objective and constraint functions may be nonlinear and/or nonconvex, and when constraint values and derivatives are tractable to compute, but objective function values and derivatives can only be estimated. The algorithm is intended primarily for a setting that is similar for stochastic-gradient methods for unconstrained optimization, namely, the setting when stochastic-gradient estimates are available and employed in place of gradients of the objective, and when no objective function values (nor estimates of them) are employed. This is achieved by the interior-point framework having a single-loop structure rather than the nested-loop structure that is typical of contemporary interior-point methods. For completeness, convergence guarantees for the framework are provided both for deterministic and stochastic settings. Numerical experiments show that the algorithm yields good performance on a large set of test problems.
☆ A Minibatch-SGD-Based Learning Meta-Policy for Inventory Systems with Myopic Optimal Policy
Stochastic gradient descent (SGD) has proven effective in solving many inventory control problems with demand learning. However, it often faces the pitfall of an infeasible target inventory level that is lower than the current inventory level. Several recent works (e.g., Huh and Rusmevichientong (2009), Shi et al.(2016)) are successful to resolve this issue in various inventory systems. However, their techniques are rather sophisticated and difficult to be applied to more complicated scenarios such as multi-product and multi-constraint inventory systems. In this paper, we address the infeasible-target-inventory-level issue from a new technical perspective -- we propose a novel minibatch-SGD-based meta-policy. Our meta-policy is flexible enough to be applied to a general inventory systems framework covering a wide range of inventory management problems with myopic clairvoyant optimal policy. By devising the optimal minibatch scheme, our meta-policy achieves a regret bound of $\mathcal{O}(\sqrt{T})$ for the general convex case and $\mathcal{O}(\log T)$ for the strongly convex case. To demonstrate the power and flexibility of our meta-policy, we apply it to three important inventory control problems: multi-product and multi-constraint systems, multi-echelon serial systems, and one-warehouse and multi-store systems by carefully designing application-specific subroutines.We also conduct extensive numerical experiments to demonstrate that our meta-policy enjoys competitive regret performance, high computational efficiency, and low variances among a wide range of applications.
comment: Forthcoming in Management Science
☆ Different Victims, Same Layout: Email Visual Similarity Detection for Enhanced Email Protection CCS 2024
In the pursuit of an effective spam detection system, the focus has often been on identifying known spam patterns either through rule-based detection systems or machine learning (ML) solutions. However, both systems are susceptible to evasion techniques and zero-day attacks that can be achieved at low cost. Therefore, an email that bypassed the defense system once can do it again in the following days, even though rules are updated or the ML models are retrained. The recurrence of failures to detect emails that exhibit layout similarities to previously undetected spam is concerning for customers and can erode their trust in a company. Our observations show that threat actors reuse email kits extensively and can bypass detection with little effort, for example, by making changes to the content of emails. In this work, we propose an email visual similarity detection approach, named Pisco, to improve the detection capabilities of an email threat defense system. We apply our proof of concept to some real-world samples received from different sources. Our results show that email kits are being reused extensively and visually similar emails are sent to our customers at various time intervals. Therefore, this method could be very helpful in situations where detection features that rely on contextual information and keywords are bypassed, an occurrence our observations show happens frequently.
comment: To be published in the proceedings of the ACM Conference on Computer and Communications Security (ACM CCS 2024)
☆ FlowRetrieval: Flow-Guided Data Retrieval for Few-Shot Imitation Learning
Few-shot imitation learning relies on only a small amount of task-specific demonstrations to efficiently adapt a policy for a given downstream tasks. Retrieval-based methods come with a promise of retrieving relevant past experiences to augment this target data when learning policies. However, existing data retrieval methods fall under two extremes: they either rely on the existence of exact behaviors with visually similar scenes in the prior data, which is impractical to assume; or they retrieve based on semantic similarity of high-level language descriptions of the task, which might not be that informative about the shared low-level behaviors or motions across tasks that is often a more important factor for retrieving relevant data for policy learning. In this work, we investigate how we can leverage motion similarity in the vast amount of cross-task data to improve few-shot imitation learning of the target task. Our key insight is that motion-similar data carries rich information about the effects of actions and object interactions that can be leveraged during few-shot adaptation. We propose FlowRetrieval, an approach that leverages optical flow representations for both extracting similar motions to target tasks from prior data, and for guiding learning of a policy that can maximally benefit from such data. Our results show FlowRetrieval significantly outperforms prior methods across simulated and real-world domains, achieving on average 27% higher success rate than the best retrieval-based prior method. In the Pen-in-Cup task with a real Franka Emika robot, FlowRetrieval achieves 3.7x the performance of the baseline imitation learning technique that learns from all prior and target data. Website: https://flow-retrieval.github.io
☆ Efficient Transonic Aeroelastic Model Reduction Using Optimized Sparse Multi-Input Polynomial Functionals
Nonlinear aeroelastic reduced-order models (ROMs) based on machine learning or artificial intelligence algorithms can be complex and computationally demanding to train, meaning that for practical aeroelastic applications, the conservative nature of linearization is often favored. Therefore, there is a requirement for novel nonlinear aeroelastic model reduction approaches that are accurate, simple and, most importantly, efficient to generate. This paper proposes a novel formulation for the identification of a compact multi-input Volterra series, where Orthogonal Matching Pursuit is used to obtain a set of optimally sparse nonlinear multi-input ROM coefficients from unsteady aerodynamic training data. The framework is exemplified using the Benchmark Supercritical Wing, considering; forced response, flutter and limit cycle oscillation. The simple and efficient Optimal Sparsity Multi-Input ROM (OSM-ROM) framework performs with high accuracy compared to the full-order aeroelastic model, requiring only a fraction of the tens-of-thousands of possible multi-input terms to be identified and allowing a 96% reduction in the number of training samples.
comment: 24 pages, preprint, under review
☆ Theoretical Insights into Overparameterized Models in Multi-Task and Replay-Based Continual Learning
Multi-task learning (MTL) is a machine learning paradigm that aims to improve the generalization performance of a model on multiple related tasks by training it simultaneously on those tasks. Unlike MTL, where the model has instant access to the training data of all tasks, continual learning (CL) involves adapting to new sequentially arriving tasks over time without forgetting the previously acquired knowledge. Despite the wide practical adoption of CL and MTL and extensive literature on both areas, there remains a gap in the theoretical understanding of these methods when used with overparameterized models such as deep neural networks. This paper studies the overparameterized linear models as a proxy for more complex models. We develop theoretical results describing the effect of various system parameters on the model's performance in an MTL setup. Specifically, we study the impact of model size, dataset size, and task similarity on the generalization error and knowledge transfer. Additionally, we present theoretical results to characterize the performance of replay-based CL models. Our results reveal the impact of buffer size and model capacity on the forgetting rate in a CL setup and help shed light on some of the state-of-the-art CL methods. Finally, through extensive empirical evaluations, we demonstrate that our theoretical findings are also applicable to deep neural networks, offering valuable guidance for designing MTL and CL models in practice.
comment: 41 pages, 21 figures
☆ AI-driven Reverse Engineering of QML Models
Quantum machine learning (QML) is a rapidly emerging area of research, driven by the capabilities of Noisy Intermediate-Scale Quantum (NISQ) devices. With the progress in the research of QML models, there is a rise in third-party quantum cloud services to cater to the increasing demand for resources. New security concerns surface, specifically regarding the protection of intellectual property (IP) from untrustworthy service providers. One of the most pressing risks is the potential for reverse engineering (RE) by malicious actors who may steal proprietary quantum IPs such as trained parameters and QML architecture, modify them to remove additional watermarks or signatures and re-transpile them for other quantum hardware. Prior work presents a brute force approach to RE the QML parameters which takes exponential time overhead. In this paper, we introduce an autoencoder-based approach to extract the parameters from transpiled QML models deployed on untrusted third-party vendors. We experiment on multi-qubit classifiers and note that they can be reverse-engineered under restricted conditions with a mean error of order 10^-1. The amount of time taken to prepare the dataset and train the model to reverse engineer the QML circuit being of the order 10^3 seconds (which is 10^2x better than the previously reported value for 4-layered 4-qubit classifiers) makes the threat of RE highly potent, underscoring the need for continued development of effective defenses.
comment: 7 pages, 4 figures
☆ Analyzing Inference Privacy Risks Through Gradients in Machine Learning
In distributed learning settings, models are iteratively updated with shared gradients computed from potentially sensitive user data. While previous work has studied various privacy risks of sharing gradients, our paper aims to provide a systematic approach to analyze private information leakage from gradients. We present a unified game-based framework that encompasses a broad range of attacks including attribute, property, distributional, and user disclosures. We investigate how different uncertainties of the adversary affect their inferential power via extensive experiments on five datasets across various data modalities. Our results demonstrate the inefficacy of solely relying on data aggregation to achieve privacy against inference attacks in distributed learning. We further evaluate five types of defenses, namely, gradient pruning, signed gradient descent, adversarial perturbations, variational information bottleneck, and differential privacy, under both static and adaptive adversary settings. We provide an information-theoretic view for analyzing the effectiveness of these defenses against inference from gradients. Finally, we introduce a method for auditing attribute inference privacy, improving the empirical estimation of worst-case privacy through crafting adversarial canary records.
☆ DLFormer: Enhancing Explainability in Multivariate Time Series Forecasting using Distributed Lag Embedding
. Most real-world variables are multivariate time series influenced by past values and explanatory factors. Consequently, predicting these time series data using artificial intelligence is ongoing. In particular, in fields such as healthcare and finance, where reliability is crucial, having understandable explanations for predictions is essential. However, achieving a balance between high prediction accuracy and intuitive explainability has proven challenging. Although attention-based models have limitations in representing the individual influences of each variable, these models can influence the temporal dependencies in time series prediction and the magnitude of the influence of individual variables. To address this issue, this study introduced DLFormer, an attention-based architecture integrated with distributed lag embedding, to temporally embed individual variables and capture their temporal influence. Through validation against various real-world datasets, DLFormer showcased superior performance improvements compared to existing attention-based high-performance models. Furthermore, comparing the relationships between variables enhanced the reliability of explainability.
☆ Exploring Multiple Strategies to Improve Multilingual Coreference Resolution in CorefUD
Coreference resolution, the task of identifying expressions in text that refer to the same entity, is a critical component in various natural language processing (NLP) applications. This paper presents our end-to-end neural coreference resolution system, utilizing the CorefUD 1.1 dataset, which spans 17 datasets across 12 languages. We first establish strong baseline models, including monolingual and cross-lingual variations, and then propose several extensions to enhance performance across diverse linguistic contexts. These extensions include cross-lingual training, incorporation of syntactic information, a Span2Head model for optimized headword prediction, and advanced singleton modeling. We also experiment with headword span representation and long-documents modeling through overlapping segments. The proposed extensions, particularly the heads-only approach, singleton modeling, and long document prediction significantly improve performance across most datasets. We also perform zero-shot cross-lingual experiments, highlighting the potential and limitations of cross-lingual transfer in coreference resolution. Our findings contribute to the development of robust and scalable coreference systems for multilingual coreference resolution. Finally, we evaluate our model on CorefUD 1.1 test set and surpass the best model from CRAC 2023 shared task of a comparable size by a large margin. Our nodel is available on GitHub: \url{https://github.com/ondfa/coref-multiling}
☆ Tex-ViT: A Generalizable, Robust, Texture-based dual-branch cross-attention deepfake detector
Deepfakes, which employ GAN to produce highly realistic facial modification, are widely regarded as the prevailing method. Traditional CNN have been able to identify bogus media, but they struggle to perform well on different datasets and are vulnerable to adversarial attacks due to their lack of robustness. Vision transformers have demonstrated potential in the realm of image classification problems, but they require enough training data. Motivated by these limitations, this publication introduces Tex-ViT (Texture-Vision Transformer), which enhances CNN features by combining ResNet with a vision transformer. The model combines traditional ResNet features with a texture module that operates in parallel on sections of ResNet before each down-sampling operation. The texture module then serves as an input to the dual branch of the cross-attention vision transformer. It specifically focuses on improving the global texture module, which extracts feature map correlation. Empirical analysis reveals that fake images exhibit smooth textures that do not remain consistent over long distances in manipulations. Experiments were performed on different categories of FF++, such as DF, f2f, FS, and NT, together with other types of GAN datasets in cross-domain scenarios. Furthermore, experiments also conducted on FF++, DFDCPreview, and Celeb-DF dataset underwent several post-processing situations, such as blurring, compression, and noise. The model surpassed the most advanced models in terms of generalization, achieving a 98% accuracy in cross-domain scenarios. This demonstrates its ability to learn the shared distinguishing textural characteristics in the manipulated samples. These experiments provide evidence that the proposed model is capable of being applied to various situations and is resistant to many post-processing procedures.
☆ Robotic warehousing operations: a learn-then-optimize approach to large-scale neighborhood search
The rapid deployment of robotics technologies requires dedicated optimization algorithms to manage large fleets of autonomous agents. This paper supports robotic parts-to-picker operations in warehousing by optimizing order-workstation assignments, item-pod assignments and the schedule of order fulfillment at workstations. The model maximizes throughput, while managing human workload at the workstations and congestion in the facility. We solve it via large-scale neighborhood search, with a novel learn-then-optimize approach to subproblem generation. The algorithm relies on an offline machine learning procedure to predict objective improvements based on subproblem features, and an online optimization model to generate a new subproblem at each iteration. In collaboration with Amazon Robotics, we show that our model and algorithm generate much stronger solutions for practical problems than state-of-the-art approaches. In particular, our solution enhances the utilization of robotic fleets by coordinating robotic tasks for human operators to pick multiple items at once, and by coordinating robotic routes to avoid congestion in the facility.
☆ LLaVA-Chef: A Multi-modal Generative Model for Food Recipes
In the rapidly evolving landscape of online recipe sharing within a globalized context, there has been a notable surge in research towards comprehending and generating food recipes. Recent advancements in large language models (LLMs) like GPT-2 and LLaVA have paved the way for Natural Language Processing (NLP) approaches to delve deeper into various facets of food-related tasks, encompassing ingredient recognition and comprehensive recipe generation. Despite impressive performance and multi-modal adaptability of LLMs, domain-specific training remains paramount for their effective application. This work evaluates existing LLMs for recipe generation and proposes LLaVA-Chef, a novel model trained on a curated dataset of diverse recipe prompts in a multi-stage approach. First, we refine the mapping of visual food image embeddings to the language space. Second, we adapt LLaVA to the food domain by fine-tuning it on relevant recipe data. Third, we utilize diverse prompts to enhance the model's recipe comprehension. Finally, we improve the linguistic quality of generated recipes by penalizing the model with a custom loss function. LLaVA-Chef demonstrates impressive improvements over pretrained LLMs and prior works. A detailed qualitative analysis reveals that LLaVA-Chef generates more detailed recipes with precise ingredient mentions, compared to existing approaches.
☆ Revising Multimodal VAEs with Diffusion Decoders
Multimodal VAEs often struggle with generating high-quality outputs, a challenge that extends beyond the inherent limitations of the VAE framework. The core issue lies in the restricted joint representation of the latent space, particularly when complex modalities like images are involved. Feedforward decoders, commonly used for these intricate modalities, inadvertently constrain the joint latent space, leading to a degradation in the quality of the other modalities as well. Although recent studies have shown improvement by introducing modality-specific representations, the issue remains significant. In this work, we demonstrate that incorporating a flexible diffusion decoder specifically for the image modality not only enhances the generation quality of the images but also positively impacts the performance of the other modalities that rely on feedforward decoders. This approach addresses the limitations imposed by conventional joint representations and opens up new possibilities for improving multimodal generation tasks using the multimodal VAE framework. Our model provides state-of-the-art results compared to other multimodal VAEs in different datasets with higher coherence and superior quality in the generated modalities
☆ Coverage Analysis of Multi-Environment Q-Learning Algorithms for Wireless Network Optimization
Q-learning is widely used to optimize wireless networks with unknown system dynamics. Recent advancements include ensemble multi-environment hybrid Q-learning algorithms, which utilize multiple Q-learning algorithms across structurally related but distinct Markovian environments and outperform existing Q-learning algorithms in terms of accuracy and complexity in large-scale wireless networks. We herein conduct a comprehensive coverage analysis to ensure optimal data coverage conditions for these algorithms. Initially, we establish upper bounds on the expectation and variance of different coverage coefficients. Leveraging these bounds, we present an algorithm for efficient initialization of these algorithms. We test our algorithm on two distinct real-world wireless networks. Numerical simulations show that our algorithm can achieve %50 less policy error and %40 less runtime complexity than state-of-the-art reinforcement learning algorithms. Furthermore, our algorithm exhibits robustness to changes in network settings and parameters. We also numerically validate our theoretical results.
☆ Longitudinal Modularity, a Modularity for Link Streams
Temporal networks are commonly used to model real-life phenomena. When these phenomena represent interactions and are captured at a fine-grained temporal resolution, they are modeled as link streams. Community detection is an essential network analysis task. Although many methods exist for static networks, and some methods have been developed for temporal networks represented as sequences of snapshots, few works can handle link streams. This article introduces the first adaptation of the well-known Modularity quality function to link streams. Unlike existing methods, it is independent of the time scale of analysis. After introducing the quality function, and its relation to existing static and dynamic definitions of Modularity, we show experimentally its relevance for dynamic community evaluation.
☆ Learning Multi-agent Multi-machine Tending by Mobile Robots
Robotics can help address the growing worker shortage challenge of the manufacturing industry. As such, machine tending is a task collaborative robots can tackle that can also highly boost productivity. Nevertheless, existing robotics systems deployed in that sector rely on a fixed single-arm setup, whereas mobile robots can provide more flexibility and scalability. In this work, we introduce a multi-agent multi-machine tending learning framework by mobile robots based on Multi-agent Reinforcement Learning (MARL) techniques with the design of a suitable observation and reward. Moreover, an attention-based encoding mechanism is developed and integrated into Multi-agent Proximal Policy Optimization (MAPPO) algorithm to boost its performance for machine tending scenarios. Our model (AB-MAPPO) outperformed MAPPO in this new challenging scenario in terms of task success, safety, and resources utilization. Furthermore, we provided an extensive ablation study to support our various design decisions.
comment: 7 pages, 4 figures
☆ GSTAM: Efficient Graph Distillation with Structural Attention-Matching ECCV
Graph distillation has emerged as a solution for reducing large graph datasets to smaller, more manageable, and informative ones. Existing methods primarily target node classification, involve computationally intensive processes, and fail to capture the true distribution of the full graph dataset. To address these issues, we introduce Graph Distillation with Structural Attention Matching (GSTAM), a novel method for condensing graph classification datasets. GSTAM leverages the attention maps of GNNs to distill structural information from the original dataset into synthetic graphs. The structural attention-matching mechanism exploits the areas of the input graph that GNNs prioritize for classification, effectively distilling such information into the synthetic graphs and improving overall distillation performance. Comprehensive experiments demonstrate GSTAM's superiority over existing methods, achieving 0.45% to 6.5% better performance in extreme condensation ratios, highlighting its potential use in advancing distillation for graph classification tasks (Code available at https://github.com/arashrasti96/GSTAM).
comment: Accepted at ECCV-DD 2024
☆ Characterization of point-source transient events with a rolling-shutter compressed sensing system
Point-source transient events (PSTEs) - optical events that are both extremely fast and extremely small - pose several challenges to an imaging system. Due to their speed, accurately characterizing such events often requires detectors with very high frame rates. Due to their size, accurately detecting such events requires maintaining coverage over an extended field-of-view, often through the use of imaging focal plane arrays (FPA) with a global shutter readout. Traditional imaging systems that meet these requirements are costly in terms of price, size, weight, power consumption, and data bandwidth, and there is a need for cheaper solutions with adequate temporal and spatial coverage. To address these issues, we develop a novel compressed sensing algorithm adapted to the rolling shutter readout of an imaging system. This approach enables reconstruction of a PSTE signature at the sampling rate of the rolling shutter, offering a 1-2 order of magnitude temporal speedup and a proportional reduction in data bandwidth. We present empirical results demonstrating accurate recovery of PSTEs using measurements that are spatially undersampled by a factor of 25, and our simulations show that, relative to other compressed sensing algorithms, our algorithm is both faster and yields higher quality reconstructions. We also present theoretical results characterizing our algorithm and corroborating simulations. The potential impact of our work includes the development of much faster, cheaper sensor solutions for PSTE detection and characterization.
comment: 20 pages, 11 figures
☆ Probabilistic Decomposed Linear Dynamical Systems for Robust Discovery of Latent Neural Dynamics
Time-varying linear state-space models are powerful tools for obtaining mathematically interpretable representations of neural signals. For example, switching and decomposed models describe complex systems using latent variables that evolve according to simple locally linear dynamics. However, existing methods for latent variable estimation are not robust to dynamical noise and system nonlinearity due to noise-sensitive inference procedures and limited model formulations. This can lead to inconsistent results on signals with similar dynamics, limiting the model's ability to provide scientific insight. In this work, we address these limitations and propose a probabilistic approach to latent variable estimation in decomposed models that improves robustness against dynamical noise. Additionally, we introduce an extended latent dynamics model to improve robustness against system nonlinearities. We evaluate our approach on several synthetic dynamical systems, including an empirically-derived brain-computer interface experiment, and demonstrate more accurate latent variable inference in nonlinear systems with diverse noise conditions. Furthermore, we apply our method to a real-world clinical neurophysiology dataset, illustrating the ability to identify interpretable and coherent structure where previous models cannot.
☆ The Star Geometry of Critic-Based Regularizer Learning
Variational regularization is a classical technique to solve statistical inference tasks and inverse problems, with modern data-driven approaches parameterizing regularizers via deep neural networks showcasing impressive empirical performance. Recent works along these lines learn task-dependent regularizers. This is done by integrating information about the measurements and ground-truth data in an unsupervised, critic-based loss function, where the regularizer attributes low values to likely data and high values to unlikely data. However, there is little theory about the structure of regularizers learned via this process and how it relates to the two data distributions. To make progress on this challenge, we initiate a study of optimizing critic-based loss functions to learn regularizers over a particular family of regularizers: gauges (or Minkowski functionals) of star-shaped bodies. This family contains regularizers that are commonly employed in practice and shares properties with regularizers parameterized by deep neural networks. We specifically investigate critic-based losses derived from variational representations of statistical distances between probability measures. By leveraging tools from star geometry and dual Brunn-Minkowski theory, we illustrate how these losses can be interpreted as dual mixed volumes that depend on the data distribution. This allows us to derive exact expressions for the optimal regularizer in certain cases. Finally, we identify which neural network architectures give rise to such star body gauges and when do such regularizers have favorable properties for optimization. More broadly, this work highlights how the tools of star geometry can aid in understanding the geometry of unsupervised regularizer learning.
☆ Machine Learning-Based Research on the Adaptability of Adolescents to Online Education
With the rapid advancement of internet technology, the adaptability of adolescents to online learning has emerged as a focal point of interest within the educational sphere. However, the academic community's efforts to develop predictive models for adolescent online learning adaptability require further refinement and expansion. Utilizing data from the "Chinese Adolescent Online Education Survey" spanning the years 2014 to 2016, this study implements five machine learning algorithms - logistic regression, K-nearest neighbors, random forest, XGBoost, and CatBoost - to analyze the factors influencing adolescent online learning adaptability and to determine the model best suited for prediction. The research reveals that the duration of courses, the financial status of the family, and age are the primary factors affecting students' adaptability in online learning environments. Additionally, age significantly impacts students' adaptive capacities. Among the predictive models, the random forest, XGBoost, and CatBoost algorithms demonstrate superior forecasting capabilities, with the random forest model being particularly adept at capturing the characteristics of students' adaptability.
♻ ☆ Batched Stochastic Bandit for Nondegenerate Functions
This paper studies batched bandit learning problems for nondegenerate functions. We introduce an algorithm that solves the batched bandit problem for nondegenerate functions near-optimally. More specifically, we introduce an algorithm, called Geometric Narrowing (GN), whose regret bound is of order $\widetilde{{\mathcal{O}}} ( A_{+}^d \sqrt{T} )$. In addition, GN only needs $\mathcal{O} (\log \log T)$ batches to achieve this regret. We also provide lower bound analysis for this problem. More specifically, we prove that over some (compact) doubling metric space of doubling dimension $d$: 1. For any policy $\pi$, there exists a problem instance on which $\pi$ admits a regret of order ${\Omega} ( A_-^d \sqrt{T})$; 2. No policy can achieve a regret of order $ A_-^d \sqrt{T} $ over all problem instances, using less than $ \Omega ( \log \log T ) $ rounds of communications. Our lower bound analysis shows that the GN algorithm achieves near optimal regret with minimal number of batches.
comment: 34 pages, 14 colored figures
♻ ☆ VGBench: Evaluating Large Language Models on Vector Graphics Understanding and Generation
In the realm of vision models, the primary mode of representation is using pixels to rasterize the visual world. Yet this is not always the best or unique way to represent visual content, especially for designers and artists who depict the world using geometry primitives such as polygons. Vector graphics (VG), on the other hand, offer a textual representation of visual content, which can be more concise and powerful for content like cartoons, sketches and scientific figures. Recent studies have shown promising results on processing vector graphics with capable Large Language Models (LLMs). However, such works focus solely on qualitative results, understanding, or a specific type of vector graphics. We propose VGBench, a comprehensive benchmark for LLMs on handling vector graphics through diverse aspects, including (a) both visual understanding and generation, (b) evaluation of various vector graphics formats, (c) diverse question types, (d) wide range of prompting techniques, (e) under multiple LLMs and (f) comparison with VLMs on rasterized representations. Evaluating on our collected 4279 understanding and 5845 generation samples, we find that LLMs show strong capability on both aspects while exhibiting less desirable performance on low-level formats (SVG). Both data and evaluation pipeline will be open-sourced at https://vgbench.github.io.
comment: Project Page: https://vgbench.github.io
♻ ☆ Conditional score-based diffusion models for solving inverse problems in mechanics
We propose a framework to perform Bayesian inference using conditional score-based diffusion models to solve a class of inverse problems in mechanics involving the inference of a specimen's spatially varying material properties from noisy measurements of its mechanical response to loading. Conditional score-based diffusion models are generative models that learn to approximate the score function of a conditional distribution using samples from the joint distribution. More specifically, the score functions corresponding to multiple realizations of the measurement are approximated using a single neural network, the so-called score network, which is subsequently used to sample the posterior distribution using an appropriate Markov chain Monte Carlo scheme based on Langevin dynamics. Training the score network only requires simulating the forward model. Hence, the proposed approach can accommodate black-box forward models and complex measurement noise. Moreover, once the score network has been trained, it can be re-used to solve the inverse problem for different realizations of the measurements. We demonstrate the efficacy of the proposed approach on a suite of high-dimensional inverse problems in mechanics that involve inferring heterogeneous material properties from noisy measurements. Some examples we consider involve synthetic data, while others include data collected from actual elastography experiments. Further, our applications demonstrate that the proposed approach can handle different measurement modalities, complex patterns in the inferred quantities, non-Gaussian and non-additive noise models, and nonlinear black-box forward models. The results show that the proposed framework can solve large-scale physics-based inverse problems efficiently.
♻ ☆ FilFL: Client Filtering for Optimized Client Participation in Federated Learning ECAI'24
Federated learning, an emerging machine learning paradigm, enables clients to collaboratively train a model without exchanging local data. Clients participating in the training process significantly impact the convergence rate, learning efficiency, and model generalization. We propose a novel approach, client filtering, to improve model generalization and optimize client participation and training. The proposed method periodically filters available clients to identify a subset that maximizes a combinatorial objective function with an efficient greedy filtering algorithm. Thus, the clients are assessed as a combination rather than individually. We theoretically analyze the convergence of federated learning with client filtering in heterogeneous settings and evaluate its performance across diverse vision and language tasks, including realistic scenarios with time-varying client availability. Our empirical results demonstrate several benefits of our approach, including improved learning efficiency, faster convergence, and up to 10% higher test accuracy than training without client filtering.
comment: Accepted at ECAI'24
♻ ☆ Learning to Prompt Your Domain for Vision-Language Models
Prompt learning has recently become a very efficient transfer learning paradigm for Contrastive Language Image Pretraining (CLIP) models. Compared with fine-tuning the entire encoder, prompt learning can obtain highly competitive results by optimizing only a small number of parameters, which presents considerably exciting benefits for federated learning applications that prioritizes communication efficiency. However, in this work, we identify that directly transferring prompt learning approaches into federated learning does not yield favorable results since the model often suffers from considerable domain gaps across different clients. To address this issue, we propose ADAPT, a novel domain-aware prompt learning approach that facilitates both intra- and inter-domain prompts across federated participants. The basic idea of ADAPT is that the prompted CLIP should detect the input image's domain correspondence and before making the prediction of its category. Extensive experiments of ADAPT demonstrate its significant efficiency and effectiveness in federated learning. For example, by learning and sharing only 0.08M parameters, our ADAPT attains a 68.4% average accuracy over six domains in the DomainNet dataset, which improves the original CLIP by a large margin of 14.8%.
♻ ☆ Evaluation Framework for Feedback Generation Methods in Skeletal Movement Assessment ECCV 2024
The application of machine-learning solutions to movement assessment from skeleton videos has attracted significant research attention in recent years. This advancement has made rehabilitation at home more accessible, utilizing movement assessment algorithms that can operate on affordable equipment for human pose detection and analysis from 2D or 3D videos. While the primary objective of automatic assessment tasks is to score movements, the automatic generation of feedback highlighting key movement issues has the potential to significantly enhance and accelerate the rehabilitation process. While numerous research works exist in the field of automatic movement assessment, only a handful address feedback generation. In this study, we propose terminology and criteria for the classification, evaluation, and comparison of feedback generation solutions. We discuss the challenges associated with each feedback generation approach and use our proposed criteria to classify existing solutions. To our knowledge, this is the first work that formulates feedback generation in skeletal movement assessment.
comment: Accepted to xAI4Biometrics 2024 at ECCV 2024
♻ ☆ Adaptive Log-Euclidean Metrics for SPD Matrix Learning
Symmetric Positive Definite (SPD) matrices have received wide attention in machine learning due to their intrinsic capacity to encode underlying structural correlation in data. Many successful Riemannian metrics have been proposed to reflect the non-Euclidean geometry of SPD manifolds. However, most existing metric tensors are fixed, which might lead to sub-optimal performance for SPD matrix learning, especially for deep SPD neural networks. To remedy this limitation, we leverage the commonly encountered pullback techniques and propose Adaptive Log-Euclidean Metrics (ALEMs), which extend the widely used Log-Euclidean Metric (LEM). Compared with the previous Riemannian metrics, our metrics contain learnable parameters, which can better adapt to the complex dynamics of Riemannian neural networks with minor extra computations. We also present a complete theoretical analysis to support our ALEMs, including algebraic and Riemannian properties. The experimental and theoretical results demonstrate the merit of the proposed metrics in improving the performance of SPD neural networks. The efficacy of our metrics is further showcased on a set of recently developed Riemannian building blocks, including Riemannian batch normalization, Riemannian Residual blocks, and Riemannian classifiers.
comment: Accepted by TIP 2024
♻ ☆ Wasserstein Gradient Boosting: A Framework for Distribution-Valued Supervised Learning
Gradient boosting is a sequential ensemble method that fits a new weaker learner to pseudo residuals at each iteration. We propose Wasserstein gradient boosting, a novel extension of gradient boosting that fits a new weak learner to alternative pseudo residuals that are Wasserstein gradients of loss functionals of probability distributions assigned at each input. It solves distribution-valued supervised learning, where the output values of the training dataset are probability distributions for each input. In classification and regression, a model typically returns, for each input, a point estimate of a parameter of a noise distribution specified for a response variable, such as the class probability parameter of a categorical distribution specified for a response label. A main application of Wasserstein gradient boosting in this paper is tree-based evidential learning, which returns a distributional estimate of the response parameter for each input. We empirically demonstrate the superior performance of the probabilistic prediction by Wasserstein gradient boosting in comparison with existing uncertainty quantification methods.
♻ ☆ GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM
Key-value (KV) caching has become the de-facto to accelerate generation speed for large language models (LLMs) inference. However, the growing cache demand with increasing sequence length has transformed LLM inference to be a memory bound problem, significantly constraining the system throughput. Existing methods rely on dropping unimportant tokens or quantizing all entries uniformly. Such methods, however, often incur high approximation errors to represent the compressed matrices. The autoregressive decoding process further compounds the error of each step, resulting in critical deviation in model generation and deterioration of performance. To tackle this challenge, we propose GEAR, an efficient KV cache compression framework that achieves near-lossless high-ratio compression. GEAR first applies quantization to majority of entries of similar magnitudes to ultra-low precision. It then employs a low rank matrix to approximate the quantization error, and a sparse matrix to remedy individual errors from outlier entries. By adeptly integrating three techniques, GEAR is able to fully exploit their synergistic potentials. Our experiments demonstrate that compared to alternatives, GEAR achieves near-lossless 4-bit KV cache compression with up to 2.38x throughput improvement, while reducing peak-memory size up to 2.29x. Our code is publicly available at https://github.com/HaoKang-Timmy/GEAR.
♻ ☆ Misam: Using ML in Dataflow Selection of Sparse-Sparse Matrix Multiplication ISCA 2024
Sparse matrix-matrix multiplication (SpGEMM) is a critical operation in numerous fields, including scientific computing, graph analytics, and deep learning. These applications exploit the sparsity of matrices to reduce storage and computational demands. However, the irregular structure of sparse matrices poses significant challenges for performance optimization. Traditional hardware accelerators are tailored for specific sparsity patterns with fixed dataflow schemes - inner, outer, and row-wise but often perform suboptimally when the actual sparsity deviates from these predetermined patterns. As the use of SpGEMM expands across various domains, each with distinct sparsity characteristics, the demand for hardware accelerators that can efficiently handle a range of sparsity patterns is increasing. This paper presents a machine learning based approach for adaptively selecting the most appropriate dataflow scheme for SpGEMM tasks with diverse sparsity patterns. By employing decision trees and deep reinforcement learning, we explore the potential of these techniques to surpass heuristic-based methods in identifying optimal dataflow schemes. We evaluate our models by comparing their performance with that of a heuristic, highlighting the strengths and weaknesses of each approach. Our findings suggest that using machine learning for dynamic dataflow selection in hardware accelerators can provide upto 28 times gains.
comment: Accepted to ISCA 2024 MLArchSys workshop https://openreview.net/forum?id=A1V9FaZRbV
♻ ☆ Iterative Methods for Vecchia-Laplace Approximations for Latent Gaussian Process Models
Latent Gaussian process (GP) models are flexible probabilistic non-parametric function models. Vecchia approximations are accurate approximations for GPs to overcome computational bottlenecks for large data, and the Laplace approximation is a fast method with asymptotic convergence guarantees to approximate marginal likelihoods and posterior predictive distributions for non-Gaussian likelihoods. Unfortunately, the computational complexity of combined Vecchia-Laplace approximations grows faster than linearly in the sample size when used in combination with direct solver methods such as the Cholesky decomposition. Computations with Vecchia-Laplace approximations can thus become prohibitively slow precisely when the approximations are usually the most accurate, i.e., on large data sets. In this article, we present iterative methods to overcome this drawback. Among other things, we introduce and analyze several preconditioners, derive new convergence results, and propose novel methods for accurately approximating predictive variances. We analyze our proposed methods theoretically and in experiments with simulated and real-world data. In particular, we obtain a speed-up of an order of magnitude compared to Cholesky-based calculations and a threefold increase in prediction accuracy in terms of the continuous ranked probability score compared to a state-of-the-art method on a large satellite data set. All methods are implemented in a free C++ software library with high-level Python and R packages.
♻ ☆ Methods for Recovering Conditional Independence Graphs: A Survey
Conditional Independence (CI) graphs are a type of probabilistic graphical models that are primarily used to gain insights about feature relationships. Each edge represents the partial correlation between the connected features which gives information about their direct dependence. In this survey, we list out different methods and study the advances in techniques developed to recover CI graphs. We cover traditional optimization methods as well as recently developed deep learning architectures along with their recommended implementations. To facilitate wider adoption, we include preliminaries that consolidate associated operations, for example techniques to obtain covariance matrix for mixed datatypes.
♻ ☆ Post-processing fairness with minimal changes
In this paper, we introduce a novel post-processing algorithm that is both model-agnostic and does not require the sensitive attribute at test time. In addition, our algorithm is explicitly designed to enforce minimal changes between biased and debiased predictions; a property that, while highly desirable, is rarely prioritized as an explicit objective in fairness literature. Our approach leverages a multiplicative factor applied to the logit value of probability scores produced by a black-box classifier. We demonstrate the efficacy of our method through empirical evaluations, comparing its performance against other four debiasing algorithms on two widely used datasets in fairness research.
♻ ☆ Not (yet) the whole story: Evaluating Visual Storytelling Requires More than Measuring Coherence, Grounding, and Repetition
Visual storytelling consists in generating a natural language story given a temporally ordered sequence of images. This task is not only challenging for models, but also very difficult to evaluate with automatic metrics since there is no consensus about what makes a story 'good'. In this paper, we introduce a novel method that measures story quality in terms of human likeness regarding three key aspects highlighted in previous work: visual grounding, coherence, and repetitiveness. We then use this method to evaluate the stories generated by several models, showing that the foundation model LLaVA obtains the best result, but only slightly so compared to TAPM, a 50-times smaller visual storytelling model. Upgrading the visual and language components of TAPM results in a model that yields competitive performance with a relatively low number of parameters. Finally, we carry out a human evaluation study, whose results suggest that a 'good' story may require more than a human-like level of visual grounding, coherence, and repetition.
♻ ☆ Gameplay Filters: Robust Zero-Shot Safety through Adversarial Imagination
Despite the impressive recent advances in learning-based robot control, ensuring robustness to out-of-distribution conditions remains an open challenge. Safety filters can, in principle, keep arbitrary control policies from incurring catastrophic failures by overriding unsafe actions, but existing solutions for complex (e.g., legged) robot dynamics do not span the full motion envelope and instead rely on local, reduced-order models. These filters tend to overly restrict agility and can still fail when perturbed away from nominal conditions. This paper presents the gameplay filter, a new class of predictive safety filter that continually plays out hypothetical matches between its simulation-trained safety strategy and a virtual adversary co-trained to invoke worst-case events and sim-to-real error, and precludes actions that would cause it to fail down the line. We demonstrate the scalability and robustness of the approach with a first-of-its-kind full-order safety filter for (36-D) quadrupedal dynamics. Physical experiments on two different quadruped platforms demonstrate the superior zero-shot effectiveness of the gameplay filter under large perturbations such as tugging and unmodeled terrain.
♻ ☆ Generalization of Hamiltonian algorithms
The paper proves generalization results for a class of stochastic learning algorithms. The method applies whenever the algorithm generates an absolutely continuous distribution relative to some a-priori measure and the Radon Nikodym derivative has subgaussian concentration. Applications are bounds for the Gibbs algorithm and randomizations of stable deterministic algorithms as well as PAC-Bayesian bounds with data-dependent priors.
♻ ☆ Trajectory Forecasting through Low-Rank Adaptation of Discrete Latent Codes
Trajectory forecasting is crucial for video surveillance analytics, as it enables the anticipation of future movements for a set of agents, e.g. basketball players engaged in intricate interactions with long-term intentions. Deep generative models offer a natural learning approach for trajectory forecasting, yet they encounter difficulties in achieving an optimal balance between sampling fidelity and diversity. We address this challenge by leveraging Vector Quantized Variational Autoencoders (VQ-VAEs), which utilize a discrete latent space to tackle the issue of posterior collapse. Specifically, we introduce an instance-based codebook that allows tailored latent representations for each example. In a nutshell, the rows of the codebook are dynamically adjusted to reflect contextual information (i.e., past motion patterns extracted from the observed trajectories). In this way, the discretization process gains flexibility, leading to improved reconstructions. Notably, instance-level dynamics are injected into the codebook through low-rank updates, which restrict the customization of the codebook to a lower dimension space. The resulting discrete space serves as the basis of the subsequent step, which regards the training of a diffusion-based predictive model. We show that such a two-fold framework, augmented with instance-level discretization, leads to accurate and diverse forecasts, yielding state-of-the-art performance on three established benchmarks.
comment: 15 pages, 3 figures, 5 tables
♻ ☆ Verification of Geometric Robustness of Neural Networks via Piecewise Linear Approximation and Lipschitz Optimisation ECAI 2024
We address the problem of verifying neural networks against geometric transformations of the input image, including rotation, scaling, shearing, and translation. The proposed method computes provably sound piecewise linear constraints for the pixel values by using sampling and linear approximations in combination with branch-and-bound Lipschitz optimisation. The method obtains provably tighter over-approximations of the perturbation region than the present state-of-the-art. We report results from experiments on a comprehensive set of verification benchmarks on MNIST and CIFAR10. We show that our proposed implementation resolves up to 32% more verification cases than present approaches.
comment: ECAI 2024
♻ ☆ On the Efficacy of Text-Based Input Modalities for Action Anticipation
Anticipating future actions is a highly challenging task due to the diversity and scale of potential future actions; yet, information from different modalities help narrow down plausible action choices. Each modality can provide diverse and often complementary context for the model to learn from. While previous multi-modal methods leverage information from modalities such as video and audio, we primarily explore how text descriptions of actions and objects can also lead to more accurate action anticipation by providing additional contextual cues, e.g., about the environment and its contents. We propose a Multi-modal Contrastive Anticipative Transformer (M-CAT), a video transformer architecture that jointly learns from multi-modal features and text descriptions of actions and objects. We train our model in two stages, where the model first learns to align video clips with descriptions of future actions, and is subsequently fine-tuned to predict future actions. Compared to existing methods, M-CAT has the advantage of learning additional context from two types of text inputs: rich descriptions of future actions during pre-training, and, text descriptions for detected objects and actions during modality feature fusion. Through extensive experimental evaluation, we demonstrate that our model outperforms previous methods on the EpicKitchens datasets, and show that using simple text descriptions of actions and objects aid in more effective action anticipation. In addition, we examine the impact of object and action information obtained via text, and perform extensive ablations.
♻ ☆ Standardized Interpretable Fairness Measures for Continuous Risk Scores
We propose a standardized version of fairness measures for continuous scores with a reasonable interpretation based on the Wasserstein distance. Our measures are easily computable and well suited for quantifying and interpreting the strength of group disparities as well as for comparing biases across different models, datasets, or time points. We derive a link between the different families of existing fairness measures for scores and show that the proposed standardized fairness measures outperform ROC-based fairness measures because they are more explicit and can quantify significant biases that ROC-based fairness measures miss.
♻ ☆ Follow-up Attention: An Empirical Study of Developer and Neural Model Code Exploration
Recent neural models of code, such as OpenAI Codex and AlphaCode, have demonstrated remarkable proficiency at code generation due to the underlying attention mechanism. However, it often remains unclear how the models actually process code, and to what extent their reasoning and the way their attention mechanism scans the code matches the patterns of developers. A poor understanding of the model reasoning process limits the way in which current neural models are leveraged today, so far mostly for their raw prediction. To fill this gap, this work studies how the processed attention signal of three open large language models - CodeGen, InCoder and GPT-J - agrees with how developers look at and explore code when each answers the same sensemaking questions about code. Furthermore, we contribute an open-source eye-tracking dataset comprising 92 manually-labeled sessions from 25 developers engaged in sensemaking tasks. We empirically evaluate five heuristics that do not use the attention and ten attention-based post-processing approaches of the attention signal of CodeGen against our ground truth of developers exploring code, including the novel concept of follow-up attention which exhibits the highest agreement between model and human attention. Our follow-up attention method can predict the next line a developer will look at with 47% accuracy. This outperforms the baseline prediction accuracy of 42.3%, which uses the session history of other developers to recommend the next line. These results demonstrate the potential of leveraging the attention signal of pre-trained models for effective code exploration.
comment: Published at IEEE Transactions on Software Engineering
♻ ☆ Unified Convergence Theory of Stochastic and Variance-Reduced Cubic Newton Methods
We study stochastic Cubic Newton methods for solving general possibly non-convex minimization problems. We propose a new framework, which we call the helper framework, that provides a unified view of the stochastic and variance-reduced second-order algorithms equipped with global complexity guarantees. It can also be applied to learning with auxiliary information. Our helper framework offers the algorithm designer high flexibility for constructing and analyzing the stochastic Cubic Newton methods, allowing arbitrary size batches, and the use of noisy and possibly biased estimates of the gradients and Hessians, incorporating both the variance reduction and the lazy Hessian updates. We recover the best-known complexities for the stochastic and variance-reduced Cubic Newton, under weak assumptions on the noise. A direct consequence of our theory is the new lazy stochastic second-order method, which significantly improves the arithmetic complexity for large dimension problems. We also establish complexity bounds for the classes of gradient-dominated objectives, that include convex and strongly convex problems. For Auxiliary Learning, we show that using a helper (auxiliary function) can outperform training alone if a given similarity measure is small.
♻ ☆ No Regrets: Investigating and Improving Regret Approximations for Curriculum Discovery
What data or environments to use for training to improve downstream performance is a longstanding and very topical question in reinforcement learning. In particular, Unsupervised Environment Design (UED) methods have gained recent attention as their adaptive curricula enable agents to be robust to in- and out-of-distribution tasks. We ask to what extent these methods are themselves robust when applied to a novel setting, closely inspired by a real-world robotics problem. Surprisingly, we find that the state-of-the-art UED methods either do not improve upon the na\"{i}ve baseline of Domain Randomisation (DR), or require substantial hyperparameter tuning to do so. Our analysis shows that this is due to their underlying scoring functions failing to predict intuitive measures of ``learnability'', i.e., in finding the settings that the agent sometimes solves, but not always. Based on this, we instead directly train on levels with high learnability and find that this simple and intuitive approach outperforms UED methods and DR in several binary-outcome environments, including on our domain and the standard UED domain of Minigrid. We further introduce a new adversarial evaluation procedure for directly measuring robustness, closely mirroring the conditional value at risk (CVaR). We open-source all our code and present visualisations of final policies here: https://github.com/amacrutherford/sampling-for-learnability.
♻ ☆ Innovative Speech-Based Deep Learning Approaches for Parkinson's Disease Classification: A Systematic Review
Parkinson's disease (PD), the second most prevalent neurodegenerative disorder worldwide, frequently presents with early-stage speech impairments. Recent advancements in Artificial Intelligence (AI), particularly deep learning (DL), have significantly enhanced PD diagnosis through the analysis of speech data. Nevertheless, the progress of research is restricted by the limited availability of publicly accessible speech-based PD datasets, primarily due to privacy concerns. The goal of this systematic review is to explore the current landscape of speech-based DL approaches for PD classification, based on 33 scientific works published between 2020 and March 2024. We discuss their available resources, capabilities, potential limitations, and issues related to bias, explainability, and privacy. Furthermore, this review provides an overview of publicly accessible speech-based datasets and open-source material for PD. The DL approaches are categorized into end-to-end (E2E) learning, transfer learning (TL) and deep acoustic features extraction (DAFE) approaches. Among E2E approaches, Convolutional Neural Networks (CNNs) are prevalent, though Transformers are increasingly popular. E2E approaches face challenges such as limited data and computational resources, especially with Transformers. TL addresses these issues by providing more robust PD diagnosis and better generalizability across languages. DAFE aims to improve the explainability and interpretability of results by examining the specific effects of deep features on both other DL approaches and more traditional machine learning (ML) methods. However, it often underperforms compared to E2E and TL approaches.
comment: Submitted in Applied Sciences - peer reviewed Open Access journal. This research was funded by the NWO research programme AiNed Fellowship Grants under the project Responsible AI for Voice Diagnostics (RAIVD) - grant number NGF.1607.22.013
♻ ☆ GANs Conditioning Methods: A Survey
In recent years, Generative Adversarial Networks (GANs) have seen significant advancements, leading to their widespread adoption across various fields. The original GAN architecture enables the generation of images without any specific control over the content, making it an unconditional generation process. However, many practical applications require precise control over the generated output, which has led to the development of conditional GANs (cGANs) that incorporate explicit conditioning to guide the generation process. cGANs extend the original framework by incorporating additional information (conditions), enabling the generation of samples that adhere to that specific criteria. Various conditioning methods have been proposed, each differing in how they integrate the conditioning information into both the generator and the discriminator networks. In this work, we review the conditioning methods proposed for GANs, exploring the characteristics of each method and highlighting their unique mechanisms and theoretical foundations. Furthermore, we conduct a comparative analysis of these methods, evaluating their performance on various image datasets. Through these analyses, we aim to provide insights into the strengths and limitations of various conditioning techniques, guiding future research and application in generative modeling.
♻ ☆ Force-Guided Bridge Matching for Full-Atom Time-Coarsened Dynamics of Peptides
Molecular Dynamics (MD) simulations are irreplaceable and ubiquitous in fields of materials science, chemistry, pharmacology just to name a few. Conventional MD simulations are plagued by numerical stability as well as long equilibration time issues, which limits broader applications of MD simulations. Recently, a surge of deep learning approaches have been devised for time-coarsened dynamics, which learns the state transition mechanism over much larger time scales to overcome these limitations. However, only a few methods target the underlying Boltzmann distribution by resampling techniques, where proposals are rarely accepted as new states with low efficiency. In this work, we propose a force-guided bridge matching model, FBM, a novel framework that first incorporates physical priors into bridge matching for full-atom time-coarsened dynamics. With the guidance of our well-designed intermediate force field, FBM is feasible to target the Boltzmann-like distribution by direct inference without extra steps. Experiments on small peptides verify our superiority in terms of comprehensive metrics and demonstrate transferability to unseen peptide systems.
♻ ☆ FRRI: a novel algorithm for fuzzy-rough rule induction
Interpretability is the next frontier in machine learning research. In the search for white box models - as opposed to black box models, like random forests or neural networks - rule induction algorithms are a logical and promising option, since the rules can easily be understood by humans. Fuzzy and rough set theory have been successfully applied to this archetype, almost always separately. As both approaches to rule induction involve granular computing based on the concept of equivalence classes, it is natural to combine them. The QuickRules\cite{JensenCornelis2009} algorithm was a first attempt at using fuzzy rough set theory for rule induction. It is based on QuickReduct, a greedy algorithm for building decision reducts. QuickRules already showed an improvement over other rule induction methods. However, to evaluate the full potential of a fuzzy rough rule induction algorithm, one needs to start from the foundations. In this paper, we introduce a novel rule induction algorithm called Fuzzy Rough Rule Induction (FRRI). We provide background and explain the workings of our algorithm. Furthermore, we perform a computational experiment to evaluate the performance of our algorithm and compare it to other state-of-the-art rule induction approaches. We find that our algorithm is more accurate while creating small rulesets consisting of relatively short rules. We end the paper by outlining some directions for future work.
♻ ☆ A Guide to Feature Importance Methods for Scientific Inference
While machine learning (ML) models are increasingly used due to their high predictive power, their use in understanding the data-generating process (DGP) is limited. Understanding the DGP requires insights into feature-target associations, which many ML models cannot directly provide due to their opaque internal mechanisms. Feature importance (FI) methods provide useful insights into the DGP under certain conditions. Since the results of different FI methods have different interpretations, selecting the correct FI method for a concrete use case is crucial and still requires expert knowledge. This paper serves as a comprehensive guide to help understand the different interpretations of global FI methods. Through an extensive review of FI methods and providing new proofs regarding their interpretation, we facilitate a thorough understanding of these methods and formulate concrete recommendations for scientific inference. We conclude by discussing options for FI uncertainty estimation and point to directions for future research aiming at full statistical inference from black-box ML models.
♻ ☆ Advances and Open Challenges in Federated Foundation Models
The integration of Foundation Models (FMs) with Federated Learning (FL) presents a transformative paradigm in Artificial Intelligence (AI). This integration offers enhanced capabilities while addressing concerns of privacy, data decentralization, and computational efficiency. This paper provides a comprehensive survey of the emerging field of Federated Foundation Models (FedFM), elucidating their synergistic relationship and exploring novel methodologies, challenges, and future directions that the FL research field needs to focus on in order to thrive in the age of FMs. A systematic multi-tiered taxonomy is proposed, categorizing existing FedFM approaches for model training, aggregation, trustworthiness, and incentivization. Key challenges, including how to enable FL to deal with high complexity of computational demands, privacy considerations, contribution evaluation, and communication efficiency, are thoroughly discussed. Moreover, the paper explores the intricate challenges of communication, scalability, and security inherent in training/fine-tuning FMs via FL. It highlights the potential of quantum computing to revolutionize the processes of training, inference, optimization, and data encryption. This survey also introduces the implementation requirement of FedFM and some practical FedFM applications. Then, this survey provides the lessons with a clear understanding of our findings for FedFM. Finally, this survey not only provides insights into the current state and challenges of FedFM but also paves the way for future research directions, emphasizing the need for developing trustworthy solutions. It serves as a foundational guide for researchers and practitioners interested in contributing to this interdisciplinary and rapidly advancing field.
comment: Survey of Federated Foundation Models (FedFM)
♻ ☆ Next Level Message-Passing with Hierarchical Support Graphs
Message-Passing Neural Networks (MPNNs) are extensively employed in graph learning tasks but suffer from limitations such as the restricted scope of information exchange, by being confined to neighboring nodes during each round of message passing. Various strategies have been proposed to address these limitations, including incorporating virtual nodes to facilitate global information exchange. In this study, we introduce the Hierarchical Support Graph (HSG), an extension of the virtual node concept created through recursive coarsening of the original graph. This approach provides a flexible framework for enhancing information flow in graphs, independent of the specific MPNN layers utilized. We present a theoretical analysis of HSGs, investigate their empirical performance, and demonstrate that HSGs can surpass other methods augmented with virtual nodes, achieving state-of-the-art results across multiple datasets.
♻ ☆ A comparison between humans and AI at recognizing objects in unusual poses
Deep learning is closing the gap with human vision on several object recognition benchmarks. Here we investigate this gap for challenging images where objects are seen in unusual poses. We find that humans excel at recognizing objects in such poses. In contrast, state-of-the-art deep networks for vision (EfficientNet, SWAG, ViT, SWIN, BEiT, ConvNext) and state-of-the-art large vision-language models (Claude 3.5, Gemini 1.5, GPT-4) are systematically brittle on unusual poses, with the exception of Gemini showing excellent robustness in that condition. As we limit image exposure time, human performance degrades to the level of deep networks, suggesting that additional mental processes (requiring additional time) are necessary to identify objects in unusual poses. An analysis of error patterns of humans vs. networks reveals that even time-limited humans are dissimilar to feed-forward deep networks. In conclusion, our comparison reveals that humans and deep networks rely on different mechanisms for recognizing objects in unusual poses. Understanding the nature of the mental processes taking place during extra viewing time may be key to reproduce the robustness of human vision in silico.
♻ ☆ Gradient Descent Fails to Learn High-frequency Functions and Modular Arithmetic
Classes of target functions containing a large number of approximately orthogonal elements are known to be hard to learn by the Statistical Query algorithms. Recently this classical fact re-emerged in a theory of gradient-based optimization of neural networks. In the novel framework, the hardness of a class is usually quantified by the variance of the gradient with respect to a random choice of a target function. A set of functions of the form $x\to ax \bmod p$, where $a$ is taken from ${\mathbb Z}_p$, has attracted some attention from deep learning theorists and cryptographers recently. This class can be understood as a subset of $p$-periodic functions on ${\mathbb Z}$ and is tightly connected with a class of high-frequency periodic functions on the real line. We present a mathematical analysis of limitations and challenges associated with using gradient-based learning techniques to train a high-frequency periodic function or modular multiplication from examples. We highlight that the variance of the gradient is negligibly small in both cases when either a frequency or the prime base $p$ is large. This in turn prevents such a learning algorithm from being successful.
♻ ☆ The $μ\mathcal{G}$ Language for Programming Graph Neural Networks
Graph neural networks form a class of deep learning architectures specifically designed to work with graph-structured data. As such, they share the inherent limitations and problems of deep learning, especially regarding the issues of explainability and trustworthiness. We propose $\mu\mathcal{G}$, an original domain-specific language for the specification of graph neural networks that aims to overcome these issues. The language's syntax is introduced, and its meaning is rigorously defined by a denotational semantics. An equivalent characterization in the form of an operational semantics is also provided and, together with a type system, is used to prove the type soundness of $\mu\mathcal{G}$. We show how $\mu\mathcal{G}$ programs can be represented in a more user-friendly graphical visualization, and provide examples of its generality by showing how it can be used to define some of the most popular graph neural network models, or to develop any custom graph processing application.
♻ ☆ Evaluating the Predictive Features of Person-Centric Knowledge Graph Embeddings: Unfolding Ablation Studies
Developing novel predictive models with complex biomedical information is challenging due to various idiosyncrasies related to heterogeneity, standardization or sparseness of the data. We previously introduced a person-centric ontology to organize information about individual patients, and a representation learning framework to extract person-centric knowledge graphs (PKGs) and to train Graph Neural Networks (GNNs). In this paper, we propose a systematic approach to examine the results of GNN models trained with both structured and unstructured information from the MIMIC-III dataset. Through ablation studies on different clinical, demographic, and social data, we show the robustness of this approach in identifying predictive features in PKGs for the task of readmission prediction.
comment: Published in the 34th Medical Informatics Europe Conference
♻ ☆ Can Synthetic Audio From Generative Foundation Models Assist Audio Recognition and Speech Modeling? INTERSPEECH
Recent advances in foundation models have enabled audio-generative models that produce high-fidelity sounds associated with music, events, and human actions. Despite the success achieved in modern audio-generative models, the conventional approach to assessing the quality of the audio generation relies heavily on distance metrics like Frechet Audio Distance. In contrast, we aim to evaluate the quality of audio generation by examining the effectiveness of using them as training data. Specifically, we conduct studies to explore the use of synthetic audio for audio recognition. Moreover, we investigate whether synthetic audio can serve as a resource for data augmentation in speech-related modeling. Our comprehensive experiments demonstrate the potential of using synthetic audio for audio recognition and speech-related modeling. Our code is available at https://github.com/usc-sail/SynthAudio.
comment: Accepted to 2024 INTERSPEECH; corrections to ActivityNet labels
♻ ☆ Efficient Topology-aware Data Augmentation for High-Degree Graph Neural Networks KDD 2024
In recent years, graph neural networks (GNNs) have emerged as a potent tool for learning on graph-structured data and won fruitful successes in varied fields. The majority of GNNs follow the message-passing paradigm, where representations of each node are learned by recursively aggregating features of its neighbors. However, this mechanism brings severe over-smoothing and efficiency issues over high-degree graphs (HDGs), wherein most nodes have dozens (or even hundreds) of neighbors, such as social networks, transaction graphs, power grids, etc. Additionally, such graphs usually encompass rich and complex structure semantics, which are hard to capture merely by feature aggregations in GNNs. Motivated by the above limitations, we propose TADA, an efficient and effective front-mounted data augmentation framework for GNNs on HDGs. Under the hood, TADA includes two key modules: (i) feature expansion with structure embeddings, and (ii) topology- and attribute-aware graph sparsification. The former obtains augmented node features and enhanced model capacity by encoding the graph structure into high-quality structure embeddings with our highly-efficient sketching method. Further, by exploiting task-relevant features extracted from graph structures and attributes, the second module enables the accurate identification and reduction of numerous redundant/noisy edges from the input graph, thereby alleviating over-smoothing and facilitating faster feature aggregations over HDGs. Empirically, TADA considerably improves the predictive performance of mainstream GNN models on 8 real homophilic/heterophilic HDGs in terms of node classification, while achieving efficient training and inference processes.
comment: This is the technical report for the paper accepted to KDD 2024. 16 pages
♻ ☆ PsychoGAT: A Novel Psychological Measurement Paradigm through Interactive Fiction Games with LLM Agents ACL 2024
Psychological measurement is essential for mental health, self-understanding, and personal development. Traditional methods, such as self-report scales and psychologist interviews, often face challenges with engagement and accessibility. While game-based and LLM-based tools have been explored to improve user interest and automate assessment, they struggle to balance engagement with generalizability. In this work, we propose PsychoGAT (Psychological Game AgenTs) to achieve a generic gamification of psychological assessment. The main insight is that powerful LLMs can function both as adept psychologists and innovative game designers. By incorporating LLM agents into designated roles and carefully managing their interactions, PsychoGAT can transform any standardized scales into personalized and engaging interactive fiction games. To validate the proposed method, we conduct psychometric evaluations to assess its effectiveness and employ human evaluators to examine the generated content across various psychological constructs, including depression, cognitive distortions, and personality traits. Results demonstrate that PsychoGAT serves as an effective assessment tool, achieving statistically significant excellence in psychometric metrics such as reliability, convergent validity, and discriminant validity. Moreover, human evaluations confirm PsychoGAT's enhancements in content coherence, interactivity, interest, immersion, and satisfaction.
comment: ACL 2024
♻ ☆ Neighborhood and Global Perturbations Supported SAM in Federated Learning: From Local Tweaks To Global Awareness
Federated Learning (FL) can be coordinated under the orchestration of a central server to collaboratively build a privacy-preserving model without the need for data exchange. However, participant data heterogeneity leads to local optima divergence, subsequently affecting convergence outcomes. Recent research has focused on global sharpness-aware minimization (SAM) and dynamic regularization techniques to enhance consistency between global and local generalization and optimization objectives. Nonetheless, the estimation of global SAM introduces additional computational and memory overhead, while dynamic regularization suffers from bias in the local and global dual variables due to training isolation. In this paper, we propose a novel FL algorithm, FedTOGA, designed to consider optimization and generalization objectives while maintaining minimal uplink communication overhead. By linking local perturbations to global updates, global generalization consistency is improved. Additionally, global updates are used to correct local dynamic regularizers, reducing dual variables bias and enhancing optimization consistency. Global updates are passively received by clients, reducing overhead. We also propose neighborhood perturbation to approximate local perturbation, analyzing its strengths and limitations. Theoretical analysis shows FedTOGA achieves faster convergence $O(1/T)$ under non-convex functions. Empirical studies demonstrate that FedTOGA outperforms state-of-the-art algorithms, with a 1\% accuracy increase and 30\% faster convergence, achieving state-of-the-art.
♻ ☆ Uncertainty-based Fairness Measures
Unfair predictions of machine learning (ML) models impede their broad acceptance in real-world settings. Tackling this arduous challenge first necessitates defining what it means for an ML model to be fair. This has been addressed by the ML community with various measures of fairness that depend on the prediction outcomes of the ML models, either at the group level or the individual level. These fairness measures are limited in that they utilize point predictions, neglecting their variances, or uncertainties, making them susceptible to noise, missingness and shifts in data. In this paper, we first show that an ML model may appear to be fair with existing point-based fairness measures but biased against a demographic group in terms of prediction uncertainties. Then, we introduce new fairness measures based on different types of uncertainties, namely, aleatoric uncertainty and epistemic uncertainty. We demonstrate on many datasets that (i) our uncertainty-based measures are complementary to existing measures of fairness, and (ii) they provide more insights about the underlying issues leading to bias.
♻ ☆ LaMAGIC: Language-Model-based Topology Generation for Analog Integrated Circuits
In the realm of electronic and electrical engineering, automation of analog circuit is increasingly vital given the complexity and customized requirements of modern applications. However, existing methods only develop search-based algorithms that require many simulation iterations to design a custom circuit topology, which is usually a time-consuming process. To this end, we introduce LaMAGIC, a pioneering language model-based topology generation model that leverages supervised finetuning for automated analog circuit design. LaMAGIC can efficiently generate an optimized circuit design from the custom specification in a single pass. Our approach involves a meticulous development and analysis of various input and output formulations for circuit. These formulations can ensure canonical representations of circuits and align with the autoregressive nature of LMs to effectively addressing the challenges of representing analog circuits as graphs. The experimental results show that LaMAGIC achieves a success rate of up to 96\% under a strict tolerance of 0.01. We also examine the scalability and adaptability of LaMAGIC, specifically testing its performance on more complex circuits. Our findings reveal the enhanced effectiveness of our adjacency matrix-based circuit formulation with floating-point input, suggesting its suitability for handling intricate circuit designs. This research not only demonstrates the potential of language models in graph generation, but also builds a foundational framework for future explorations in automated analog circuit design.
comment: Proceedings of the 41st International Conference on Machine Learning, PMLR 235:6253-6262 https://proceedings.mlr.press/v235/chang24c.html
♻ ☆ 1 From the Pursuit of Universal AGI Architecture to Systematic Approach to Heterogenous AGI: Addressing Alignment, Energy, & AGI Grand Challenges
AI faces a trifecta of grand challenges: the Energy Wall, the Alignment Problem and the Leap from Narrow AI to AGI. Contemporary AI solutions consume unsustainable amounts of energy during model training and daily operations. Making things worse, the amount of computation required to train each new AI model has been doubling every 2 months since 2020, directly translating to unprecedented increases in energy consumption. The leap from AI to AGI requires multiple functional subsystems operating in a balanced manner, which requires a system architecture. However, the current approach to artificial intelligence lacks system design; even though system characteristics play a key role in the human brain; from the way it processes information to how it makes decisions. System design is the key to alignment, one of the most challenging goals in AI. This difficulty stems from the fact that the complexity of human moral system requires a similarly sophisticated system for alignment. Without accurately reflecting the complexity of these core moral subsystems and systems, aligning AI with human values becomes significantly more challenging. In this paper, we posit that system design is the missing piece in overcoming the grand challenges. We present a Systematic Approach to AGI that utilizes system design principles to AGI, while providing ways to overcome the energy wall and the alignment challenges. This paper asserts that artificial intelligence can be realized through a multiplicity of design-specific pathways, rather than a singular, overarching AGI architecture. AGI systems may exhibit diverse architectural configurations and capabilities, contingent upon their intended use cases. It advocates for a focus on employing system design principles as a guiding framework, rather than solely concentrating on a universal AGI architecture.
comment: International Journal on Semantic Computing (2024) Categories: Artificial Intelligence; AI; Artificial General Intelligence; AGI; System Design; System Architecture
♻ ☆ Erasing Concepts from Text-to-Image Diffusion Models with Few-shot Unlearning BMVC2024
Generating images from text has become easier because of the scaling of diffusion models and advancements in the field of vision and language. These models are trained using vast amounts of data from the Internet. Hence, they often contain undesirable content such as copyrighted material. As it is challenging to remove such data and retrain the models, methods for erasing specific concepts from pre-trained models have been investigated. We propose a novel concept-erasure method that updates the text encoder using few-shot unlearning in which a few real images are used. The discussion regarding the generated images after erasing a concept has been lacking. While there are methods for specifying the transition destination for concepts, the validity of the specified concepts is unclear. Our method implicitly achieves this by transitioning to the latent concepts inherent in the model or the images. Our method can erase a concept within 10 s, making concept erasure more accessible than ever before. Implicitly transitioning to related concepts leads to more natural concept erasure. We applied the proposed method to various concepts and confirmed that concept erasure can be achieved tens to hundreds of times faster than with current methods. By varying the parameters to be updated, we obtained results suggesting that, like previous research, knowledge is primarily accumulated in the feed-forward networks of the text encoder. Our code is available at \url{https://github.com/fmp453/few-shot-erasing}
comment: 25 pages, 28 figures, accepted by BMVC2024
♻ ☆ A Best-of-Both-Worlds Algorithm for Constrained MDPs with Long-Term Constraints
We study online learning in episodic constrained Markov decision processes (CMDPs), where the learner aims at collecting as much reward as possible over the episodes, while satisfying some long-term constraints during the learning process. Rewards and constraints can be selected either stochastically or adversarially, and the transition function is not known to the learner. While online learning in classical (unconstrained) MDPs has received considerable attention over the last years, the setting of CMDPs is still largely unexplored. This is surprising, since in real-world applications, such as, e.g., autonomous driving, automated bidding, and recommender systems, there are usually additional constraints and specifications that an agent has to obey during the learning process. In this paper, we provide the first best-of-both-worlds algorithm for CMDPs with long-term constraints, in the flavor of Balseiro et al. (2023). Our algorithm is capable of handling settings in which rewards and constraints are selected either stochastically or adversarially, without requiring any knowledge of the underling process. Moreover, our algorithm matches state-of-the-art regret and constraint violation bounds for settings in which constraints are selected stochastically, while it is the first to provide guarantees in the case in which they are chosen adversarially.
♻ ☆ Category-Theoretical and Topos-Theoretical Frameworks in Machine Learning: A Survey
In this survey, we provide an overview of category theory-derived machine learning from four mainstream perspectives: gradient-based learning, probability-based learning, invariance and equivalence-based learning, and topos-based learning. For the first three topics, we primarily review research in the past five years, updating and expanding on the previous survey by Shiebler et al.. The fourth topic, which delves into higher category theory, particularly topos theory, is surveyed for the first time in this paper. In certain machine learning methods, the compositionality of functors plays a vital role, prompting the development of specific categorical frameworks. However, when considering how the global properties of a network reflect in local structures and how geometric properties are expressed with logic, the topos structure becomes particularly significant and profound.
♻ ☆ CAST: Cluster-Aware Self-Training for Tabular Data via Reliable Confidence
Tabular data is one of the most widely used data modalities, encompassing numerous datasets with substantial amounts of unlabeled data. Despite this prevalence, there is a notable lack of simple and versatile methods for utilizing unlabeled data in the tabular domain, where both gradient-boosting decision trees and neural networks are employed. In this context, self-training has gained attraction due to its simplicity and versatility, yet it is vulnerable to noisy pseudo-labels caused by erroneous confidence. Several solutions have been proposed to handle this problem, but they often compromise the inherent advantages of self-training, resulting in limited applicability in the tabular domain. To address this issue, we explore a novel direction of reliable confidence in self-training contexts and conclude that self-training can be improved by making that the confidence, which represents the value of the pseudo-label, aligns with the cluster assumption. In this regard, we propose Cluster-Aware Self-Training (CAST) for tabular data, which enhances existing self-training algorithms at a negligible cost while maintaining simplicity and versatility. Concretely, CAST calibrates confidence by regularizing the classifier's confidence based on local density for each class in the labeled training data, resulting in lower confidence for pseudo-labels in low-density regions. Extensive empirical evaluations on up to 21 real-world datasets confirm not only the superior performance of CAST but also its robustness in various setups in self-training contexts.
comment: 11 pages for main body, and 10 additional pages for appendix
♻ ☆ Mitigating Label Noise on Graph via Topological Sample Selection ICML 2024
Despite the success of the carefully-annotated benchmarks, the effectiveness of existing graph neural networks (GNNs) can be considerably impaired in practice when the real-world graph data is noisily labeled. Previous explorations in sample selection have been demonstrated as an effective way for robust learning with noisy labels, however, the conventional studies focus on i.i.d data, and when moving to non-iid graph data and GNNs, two notable challenges remain: (1) nodes located near topological class boundaries are very informative for classification but cannot be successfully distinguished by the heuristic sample selection. (2) there is no available measure that considers the graph topological information to promote sample selection in a graph. To address this dilemma, we propose a $\textit{Topological Sample Selection}$ (TSS) method that boosts the informative sample selection process in a graph by utilising topological information. We theoretically prove that our procedure minimizes an upper bound of the expected risk under target clean distribution, and experimentally show the superiority of our method compared with state-of-the-art baselines.
comment: ICML 2024
♻ ☆ Satellite Sunroof: High-res Digital Surface Models and Roof Segmentation for Global Solar Mapping
The transition to renewable energy, particularly solar, is key to mitigating climate change. Google's Solar API aids this transition by estimating solar potential from aerial imagery, but its impact is constrained by geographical coverage. This paper proposes expanding the API's reach using satellite imagery, enabling global solar potential assessment. We tackle challenges involved in building a Digital Surface Model (DSM) and roof instance segmentation from lower resolution and single oblique views using deep learning models. Our models, trained on aligned satellite and aerial datasets, produce 25cm DSMs and roof segments. With ~1m DSM MAE on buildings, ~5deg roof pitch error and ~56% IOU on roof segmentation, they significantly enhance the Solar API's potential to promote solar adoption.
comment: 14 pages
♻ ☆ DiffiT: Diffusion Vision Transformers for Image Generation ECCV'24
Diffusion models with their powerful expressivity and high sample quality have achieved State-Of-The-Art (SOTA) performance in the generative domain. The pioneering Vision Transformer (ViT) has also demonstrated strong modeling capabilities and scalability, especially for recognition tasks. In this paper, we study the effectiveness of ViTs in diffusion-based generative learning and propose a new model denoted as Diffusion Vision Transformers (DiffiT). Specifically, we propose a methodology for finegrained control of the denoising process and introduce the Time-dependant Multihead Self Attention (TMSA) mechanism. DiffiT is surprisingly effective in generating high-fidelity images with significantly better parameter efficiency. We also propose latent and image space DiffiT models and show SOTA performance on a variety of class-conditional and unconditional synthesis tasks at different resolutions. The Latent DiffiT model achieves a new SOTA FID score of 1.73 on ImageNet256 dataset while having 19.85%, 16.88% less parameters than other Transformer-based diffusion models such as MDT and DiT,respectively. Code: https://github.com/NVlabs/DiffiT
comment: Accepted to ECCV'24
♻ ☆ Learning from Heterogeneity: A Dynamic Learning Framework for Hypergraphs
Graph neural network (GNN) has gained increasing popularity in recent years owing to its capability and flexibility in modeling complex graph structure data. Among all graph learning methods, hypergraph learning is a technique for exploring the implicit higher-order correlations when training the embedding space of the graph. In this paper, we propose a hypergraph learning framework named LFH that is capable of dynamic hyperedge construction and attentive embedding update utilizing the heterogeneity attributes of the graph. Specifically, in our framework, the high-quality features are first generated by the pairwise fusion strategy that utilizes explicit graph structure information when generating initial node embedding. Afterwards, a hypergraph is constructed through the dynamic grouping of implicit hyperedges, followed by the type-specific hypergraph learning process. To evaluate the effectiveness of our proposed framework, we conduct comprehensive experiments on several popular datasets with eleven state-of-the-art models on both node classification and link prediction tasks, which fall into categories of homogeneous pairwise graph learning, heterogeneous pairwise graph learning, and hypergraph learning. The experiment results demonstrate a significant performance gain (average 12.5% in node classification and 13.3% in link prediction) compared with recent state-of-the-art methods.
♻ ☆ Communication Optimization for Distributed Training: Architecture, Advances, and Opportunities
The past few years have witnessed the flourishing of large-scale deep neural network models with ever-growing parameter numbers. Training such large-scale models typically requires massive memory and computing resources, necessitating distributed training. As GPU performance has rapidly evolved in recent years, computation time has shrunk, making communication a larger portion of the overall training time. Consequently, optimizing communication for distributed training has become crucial. In this article, we briefly introduce the general architecture of distributed deep neural network training and analyze relationships among Parallelization Strategy, Collective Communication Library, and Network from the perspective of communication optimization, which forms a three-layer paradigm. We then review current representative research advances within this three-layer paradigm. We find that layers in the current three-layer paradigm are relatively independent and there is a rich design space for cross-layer collaborative optimization in distributed training scenarios. Therefore, we advocate "Vertical" and "Horizontal" co-designs which extend the three-layer paradigm to a five-layer paradigm. We also advocate "Intra-Inter" and "Host-Net" co-designs to further utilize the potential of heterogeneous resources. We hope this article can shed some light on future research on communication optimization for distributed training.
comment: 8 pages, 5 figures
♻ ☆ GenRec: Generative Sequential Recommendation with Large Language Models
Sequential recommendation is a task to capture hidden user preferences from historical user item interaction data and recommend next items for the user. Significant progress has been made in this domain by leveraging classification based learning methods. Inspired by the recent paradigm of 'pretrain, prompt and predict' in NLP, we consider sequential recommendation as a sequence to sequence generation task and propose a novel model named Generative Recommendation (GenRec). Unlike classification based models that learn explicit user and item representations, GenRec utilizes the sequence modeling capability of Transformer and adopts the masked item prediction objective to effectively learn the hidden bidirectional sequential patterns. Different from existing generative sequential recommendation models, GenRec does not rely on manually designed hard prompts. The input to GenRec is textual user item sequence and the output is top ranked next items. Moreover, GenRec is lightweight and requires only a few hours to train effectively in low-resource settings, making it highly applicable to real-world scenarios and helping to democratize large language models in the sequential recommendation domain. Our extensive experiments have demonstrated that GenRec generalizes on various public real-world datasets and achieves state-of-the-art results. Our experiments also validate the effectiveness of the the proposed masked item prediction objective that improves the model performance by a large margin.
♻ ☆ Enhancing Data-Limited Graph Neural Networks by Actively Distilling Knowledge from Large Language Models
Graphs are pervasive in the real-world, such as social network analysis, bioinformatics, and knowledge graphs. Graph neural networks (GNNs) have great ability in node classification, a fundamental task on graphs. Unfortunately, conventional GNNs still face challenges in scenarios with few labeled nodes, despite the prevalence of few-shot node classification tasks in real-world applications. To address this challenge, various approaches have been proposed, including graph meta-learning, transfer learning, and methods based on Large Language Models (LLMs). However, traditional meta-learning and transfer learning methods often require prior knowledge from base classes or fail to exploit the potential advantages of unlabeled nodes. Meanwhile, LLM-based methods may overlook the zero-shot capabilities of LLMs and rely heavily on the quality of generated contexts. In this paper, we propose a novel approach that integrates LLMs and GNNs, leveraging the zero-shot inference and reasoning capabilities of LLMs and employing a Graph-LLM-based active learning paradigm to enhance GNNs' performance. Extensive experiments demonstrate the effectiveness of our model in improving node classification accuracy with considerably limited labeled data, surpassing state-of-the-art baselines by significant margins.
comment: 10 pages, 3 Figures
♻ ☆ VFLIP: A Backdoor Defense for Vertical Federated Learning via Identification and Purification ESORICS 2024
Vertical Federated Learning (VFL) focuses on handling vertically partitioned data over FL participants. Recent studies have discovered a significant vulnerability in VFL to backdoor attacks which specifically target the distinct characteristics of VFL. Therefore, these attacks may neutralize existing defense mechanisms designed primarily for Horizontal Federated Learning (HFL) and deep neural networks. In this paper, we present the first backdoor defense, called VFLIP, specialized for VFL. VFLIP employs the identification and purification techniques that operate at the inference stage, consequently improving the robustness against backdoor attacks to a great extent. VFLIP first identifies backdoor-triggered embeddings by adopting a participant-wise anomaly detection approach. Subsequently, VFLIP conducts purification which removes the embeddings identified as malicious and reconstructs all the embeddings based on the remaining embeddings. We conduct extensive experiments on CIFAR10, CINIC10, Imagenette, NUS-WIDE, and BankMarketing to demonstrate that VFLIP can effectively mitigate backdoor attacks in VFL. https://github.com/blingcho/VFLIP-esorics24
comment: Accepted by 29th European Symposium on Research in Computer Security (ESORICS 2024)
♻ ☆ CityLight: A Universal Model for Coordinated Traffic Signal Control in City-scale Heterogeneous Intersections
The increasingly severe congestion problem in modern cities strengthens the significance of developing city-scale traffic signal control (TSC) methods for traffic efficiency enhancement. While reinforcement learning has been widely explored in TSC, most of them still target small-scale optimization and cannot directly scale to the city level due to unbearable resource demand. Only a few of them manage to tackle city-level optimization, namely a thousand-scale optimization, by incorporating parameter-sharing mechanisms, but hardly have they fully tackled the heterogeneity of intersections and intricate between-intersection interactions inherent in real-world city road networks. To fill in the gap, we target at the two important challenges in adopting parameter-sharing paradigms to solve TSC: inconsistency of inner state representations for intersections heterogeneous in configuration, scale, and orders of available traffic phases; intricacy of impacts from neighborhood intersections that have various relative traffic relationships due to inconsistent phase orders and diverse relative positioning. Our method, CityLight, features a universal representation module that not only aligns the state representations of intersections by reindexing their phases based on their semantics and designing heterogeneity-preserving observations, but also encodes the narrowed relative traffic relation types to project the neighborhood intersections onto a uniform relative traffic impact space. We further attentively fuse neighborhood representations based on their competing relations and incorporate neighborhood-integrated rewards to boost coordination. Extensive experiments with hundreds to tens of thousands of intersections validate the surprising effectiveness and generalizability of CityLight, with an overall performance gain of 11.68% and a 22.59% improvement in transfer scenarios in throughput.
♻ ☆ Non-Stationary Bandit Learning via Predictive Sampling
Thompson sampling has proven effective across a wide range of stationary bandit environments. However, as we demonstrate in this paper, it can perform poorly when applied to non-stationary environments. We attribute such failures to the fact that, when exploring, the algorithm does not differentiate actions based on how quickly the information acquired loses its usefulness due to non-stationarity. Building upon this insight, we propose predictive sampling, an algorithm that deprioritizes acquiring information that quickly loses usefulness. A theoretical guarantee on the performance of predictive sampling is established through a Bayesian regret bound. We provide versions of predictive sampling for which computations tractably scale to complex bandit environments of practical interest. Through numerical simulations, we demonstrate that predictive sampling outperforms Thompson sampling in all non-stationary environments examined.
♻ ☆ A Normalized Bottleneck Distance on Persistence Diagrams and Homology Preservation under Dimension Reduction
Persistence diagrams (PDs) are used as signatures of point cloud data. Two clouds of points can be compared using the bottleneck distance d_B between their PDs. A potential drawback of this pipeline is that point clouds sampled from topologically similar manifolds can have arbitrarily large d_B when there is a large scaling between them. This situation is typical in dimension reduction frameworks. We define, and study properties of, a new scale-invariant distance between PDs termed normalized bottleneck distance, d_N. In defining d_N, we develop a broader framework called metric decomposition for comparing finite metric spaces of equal cardinality with a bijection. We utilize metric decomposition to prove a stability result for d_N by deriving an explicit bound on the distortion of the bijective map. We then study two popular dimension reduction techniques, Johnson-Lindenstrauss (JL) projections and metric multidimensional scaling (mMDS), and a third class of general biLipschitz mappings. We provide new bounds on how well these dimension reduction techniques preserve homology with respect to d_N. For a JL map f that transforms input X to f(X), we show that d_N(dgm(X),dgm(f(X))) < e, where dgm(X) is the Vietoris-Rips PD of X, and pairwise distances are preserved by f up to the tolerance 0 < \epsilon < 1. For mMDS, we present new bounds for d_B and d_N between PDs of X and its projection in terms of the eigenvalues of the covariance matrix. And for k-biLipschitz maps, we show that d_N is bounded by the product of (k^2-1)/k and the ratio of diameters of X and f(X). Finally, we use computational experiments to demonstrate the increased effectiveness of using the normalized bottleneck distance for clustering sets of point clouds sampled from different shapes.
comment: Added computational experiments; published in La Matematica
♻ ☆ LlamaDuo: LLMOps Pipeline for Seamless Migration from Service LLMs to Small-Scale Local LLMs
The widespread adoption of cloud-based proprietary large language models (LLMs) has introduced significant challenges, including operational dependencies, privacy concerns, and the necessity of continuous internet connectivity. In this work, we introduce an LLMOps pipeline, "LlamaDuo", for the seamless migration of knowledge and abilities from service-oriented LLMs to smaller, locally manageable models. This pipeline is crucial for ensuring service continuity in the presence of operational failures, strict privacy policies, or offline requirements. Our LlamaDuo involves fine-tuning a small language model against the service LLM using a synthetic dataset generated by the latter. If the performance of the fine-tuned model falls short of expectations, it is enhanced by further fine-tuning with additional similar data created by the service LLM. This iterative process guarantees that the smaller model can eventually match or even surpass the service LLM's capabilities in specific downstream tasks, offering a practical and scalable solution for managing AI deployments in constrained environments. Extensive experiments with leading edge LLMs are conducted to demonstrate the effectiveness, adaptability, and affordability of LlamaDuo across various downstream tasks. Our pipeline implementation is available at https://github.com/deep-diver/llamaduo.
comment: 28 pages, 18 figures, 6 tables
♻ ☆ Scalable Variational Causal Discovery Unconstrained by Acyclicity ECAI 2024
Bayesian causal discovery offers the power to quantify epistemic uncertainties among a broad range of structurally diverse causal theories potentially explaining the data, represented in forms of directed acyclic graphs (DAGs). However, existing methods struggle with efficient DAG sampling due to the complex acyclicity constraint. In this study, we propose a scalable Bayesian approach to effectively learn the posterior distribution over causal graphs given observational data thanks to the ability to generate DAGs without explicitly enforcing acyclicity. Specifically, we introduce a novel differentiable DAG sampling method that can generate a valid acyclic causal graph by mapping an unconstrained distribution of implicit topological orders to a distribution over DAGs. Given this efficient DAG sampling scheme, we are able to model the posterior distribution over causal graphs using a simple variational distribution over a continuous domain, which can be learned via the variational inference framework. Extensive empirical experiments on both simulated and real datasets demonstrate the superior performance of the proposed model compared to several state-of-the-art baselines.
comment: Accepted at ECAI 2024
♻ ☆ Enabling Causal Discovery in Post-Nonlinear Models with Normalizing Flows ECAI 2024
Post-nonlinear (PNL) causal models stand out as a versatile and adaptable framework for modeling intricate causal relationships. However, accurately capturing the invertibility constraint required in PNL models remains challenging in existing studies. To address this problem, we introduce CAF-PoNo (Causal discovery via Normalizing Flows for Post-Nonlinear models), harnessing the power of the normalizing flows architecture to enforce the crucial invertibility constraint in PNL models. Through normalizing flows, our method precisely reconstructs the hidden noise, which plays a vital role in cause-effect identification through statistical independence testing. Furthermore, the proposed approach exhibits remarkable extensibility, as it can be seamlessly expanded to facilitate multivariate causal discovery via causal order identification, empowering us to efficiently unravel complex causal relationships. Extensive experimental evaluations on both simulated and real datasets consistently demonstrate that the proposed method outperforms several state-of-the-art approaches in both bivariate and multivariate causal discovery tasks.
comment: Acepted at ECAI 2024
♻ ☆ How to avoid machine learning pitfalls: a guide for academic researchers
Mistakes in machine learning practice are commonplace, and can result in a loss of confidence in the findings and products of machine learning. This guide outlines common mistakes that occur when using machine learning, and what can be done to avoid them. Whilst it should be accessible to anyone with a basic understanding of machine learning techniques, it focuses on issues that are of particular concern within academic research, such as the need to do rigorous comparisons and reach valid conclusions. It covers five stages of the machine learning process: what to do before model building, how to reliably build models, how to robustly evaluate models, how to compare models fairly, and how to report results.
♻ ☆ Estimating Direct and Indirect Causal Effects of Spatiotemporal Interventions in Presence of Spatial Interference
Spatial interference (SI) occurs when the treatment at one location affects the outcomes at other locations. Accounting for spatial interference in spatiotemporal settings poses further challenges as interference violates the stable unit treatment value assumption, making it infeasible for standard causal inference methods to quantify the effects of time-varying treatment at spatially varying outcomes. In this paper, we first formalize the concept of spatial interference in case of time-varying treatment assignments by extending the potential outcome framework under the assumption of no unmeasured confounding. We then propose our deep learning based potential outcome model for spatiotemporal causal inference. We utilize latent factor modeling to reduce the bias due to time-varying confounding while leveraging the power of U-Net architecture to capture global and local spatial interference in data over time. Our causal estimators are an extension of average treatment effect (ATE) for estimating direct (DATE) and indirect effects (IATE) of spatial interference on treated and untreated data. Being the first of its kind deep learning based spatiotemporal causal inference technique, our approach shows advantages over several baseline methods based on the experiment results on two synthetic datasets, with and without spatial interference. Our results on real-world climate dataset also align with domain knowledge, further demonstrating the effectiveness of our proposed method.
♻ ☆ Anchored Preference Optimization and Contrastive Revisions: Addressing Underspecification in Alignment
Large Language Models (LLMs) are often aligned using contrastive alignment objectives and preference pair datasets. The interaction between model, paired data, and objective makes alignment a complicated procedure, sometimes producing subpar results. We study this and find that (i) preference data gives a better learning signal when the underlying responses are contrastive, and (ii) alignment objectives lead to better performance when they specify more control over the model during training. Based on these insights, we introduce Contrastive Learning from AI Revisions (CLAIR), a data-creation method which leads to more contrastive preference pairs, and Anchored Preference Optimization (APO), a controllable and more stable alignment objective. We align Llama-3-8B-Instruct using various comparable datasets and alignment objectives and measure MixEval-Hard scores, which correlate highly with human judgments. The CLAIR preferences lead to the strongest performance out of all datasets, and APO consistently outperforms less controllable objectives. Our best model, trained on 32K CLAIR preferences with APO, improves Llama-3-8B-Instruct by 7.65%, closing the gap with GPT4-turbo by 45%. Our code is available at https://github.com/ContextualAI/CLAIR_and_APO.
♻ ☆ Joint Optimization of Piecewise Linear Ensembles SP 2024
Tree ensembles achieve state-of-the-art performance on numerous prediction tasks. We propose $\textbf{J}$oint $\textbf{O}$ptimization of $\textbf{P}$iecewise $\textbf{L}$inear $\textbf{En}$sembles (JOPLEn), which jointly fits piecewise linear models at all leaf nodes of an existing tree ensemble. In addition to enhancing the ensemble expressiveness, JOPLEn allows several common penalties, including sparsity-promoting and subspace-norms, to be applied to nonlinear prediction. For example, JOPLEn with a nuclear norm penalty learns subspace-aligned functions. Additionally, JOPLEn (combined with a Dirty LASSO penalty) is an effective feature selection method for nonlinear prediction in multitask learning. Finally, we demonstrate the performance of JOPLEn on 153 regression and classification datasets and with a variety of penalties. JOPLEn leads to improved prediction performance relative to not only standard random forest and boosted tree ensembles, but also other methods for enhancing tree ensembles.
comment: 7 pages, 4 figures, accepted to IEEE MLSP 2024 While preparing the code release, we found minor bugs in the penalty gradient computation and the validation set preprocessing. Fixing these bugs provides the updated results shown in Figure 1 and Section 3.1. The conclusions of the paper remain the same
♻ ☆ DAISY: Data Adaptive Self-Supervised Early Exit for Speech Representation Models
Self-supervised speech models have shown to be useful for various tasks, but their large size limits the use in devices with low computing power and memory. In this work, we explore early exit, an approach for reducing latency by exiting the forward process of a network early. Most approaches of early exit need a separate early exit model for each task, with some even requiring fine-tuning of the entire pretrained model. We introduce Data Adaptive Self-Supervised Early Exit (DAISY), an approach that decides when to exit based on the self-supervised loss, eliminating the need for multiple round of training and fine-tuning. DAISY matches the performance of HuBERT on the MiniSUPERB benchmark, but with much faster inference times. Our analysis on the adaptivity of DAISY shows that the model exits early (using fewer layers) on clean data while exits late (using more layers) on noisy data, dynamically adjusting the computational cost of inference based on the noise level of each sample.
comment: Accepted by Interspeech 2024
♻ ☆ Attribute Graphs Underlying Molecular Generative Models: Path to Learning with Limited Data
Training generative models that capture rich semantics of the data and interpreting the latent representations encoded by such models are very important problems in un-/self-supervised learning. In this work, we provide a simple algorithm that relies on perturbation experiments on latent codes of a pre-trained generative autoencoder to uncover an attribute graph that is implied by the generative model. We perform perturbation experiments to check for influence of a given latent variable on a subset of attributes. Given this, we show that one can fit an effective graphical model that models a structural equation model between latent codes taken as exogenous variables and attributes taken as observed variables. One interesting aspect is that a single latent variable controls multiple overlapping subsets of attributes unlike conventional approaches that try to impose full independence. Using a pre-trained generative autoencoder trained on a large dataset of small molecules, we demonstrate that the graphical model between various molecular attributes and latent codes learned by our algorithm can be used to predict a specific property for molecules which are drawn from a different distribution. We compare prediction models trained on various feature subsets chosen by simple baselines, as well as existing causal discovery and sparse learning/feature selection methods, with the ones in the derived Markov blanket from our method. Results show empirically that the predictor that relies on our Markov blanket attributes is robust to distribution shifts when transferred or fine-tuned with a few samples from the new distribution, especially when training data is limited.
comment: New experiments; reframed contributions
♻ ☆ MelHuBERT: A simplified HuBERT on Mel spectrograms
Self-supervised models have had great success in learning speech representations that can generalize to various downstream tasks. However, most self-supervised models require a large amount of compute and multiple GPUs to train, significantly hampering the development of self-supervised learning. In an attempt to reduce the computation of training, we revisit the training of HuBERT, a highly successful self-supervised model. We improve and simplify several key components, including the loss function, input representation, and training in multiple stages. Our model, MelHuBERT, is able to achieve favorable performance on phone recognition, speaker identification, and automatic speech recognition against HuBERT, while saving 31.2% of the pre-training time, or equivalently 33.5% MACs per one second speech. The code and pre-trained models are available in https://github.com/nervjack2/MelHuBERT.
comment: ASRU 2023
♻ ☆ Are Small Language Models Ready to Compete with Large Language Models for Practical Applications?
The rapid rise of Language Models (LMs) has expanded their use in several applications. Yet, due to constraints of model size, associated cost, or proprietary restrictions, utilizing state-of-the-art (SOTA) LLMs is not always feasible. With open, smaller LMs emerging, more applications can leverage their capabilities, but selecting the right LM can be challenging as smaller LMs don't perform well universally. This work tries to bridge this gap by proposing a framework to experimentally evaluate small, open LMs in practical settings through measuring semantic correctness of outputs across three practical aspects: task types, application domains and reasoning types, using diverse prompt styles. It also conducts an in-depth comparison of 10 small, open LMs to identify best LM and prompt style depending on specific application requirement using the proposed framework. We also show that if selected appropriately, they can outperform SOTA LLMs like DeepSeek-v2, GPT-4o-mini, Gemini-1.5-Pro, and even compete with GPT-4o.
comment: Submitted to ARR
♻ ☆ BiomedBench: A benchmark suite of TinyML biomedical applications for low-power wearables
The design of low-power wearables for the biomedical domain has received a lot of attention in recent decades, as technological advances in chip manufacturing have allowed real-time monitoring of patients using low-complexity ML within the mW range. Despite advances in application and hardware design research, the domain lacks a systematic approach to hardware evaluation. In this work, we propose BiomedBench, a new benchmark suite composed of complete end-to-end TinyML biomedical applications for real-time monitoring of patients using wearable devices. Each application presents different requirements during typical signal acquisition and processing phases, including varying computational workloads and relations between active and idle times. Furthermore, our evaluation of five state-of-the-art low-power platforms in terms of energy efficiency shows that modern platforms cannot effectively target all types of biomedical applications. BiomedBench is released as an open-source suite to standardize hardware evaluation and guide hardware and application design in the TinyML wearable domain.
comment: 7 pages, 5 figures. Sumbitted to Design & Test Special Issue TinyML
♻ ☆ Loop Copilot: Conducting AI Ensembles for Music Generation and Iterative Editing
Creating music is iterative, requiring varied methods at each stage. However, existing AI music systems fall short in orchestrating multiple subsystems for diverse needs. To address this gap, we introduce Loop Copilot, a novel system that enables users to generate and iteratively refine music through an interactive, multi-round dialogue interface. The system uses a large language model to interpret user intentions and select appropriate AI models for task execution. Each backend model is specialized for a specific task, and their outputs are aggregated to meet the user's requirements. To ensure musical coherence, essential attributes are maintained in a centralized table. We evaluate the effectiveness of the proposed system through semi-structured interviews and questionnaires, highlighting its utility not only in facilitating music creation but also its potential for broader applications.
comment: Source code and demo video are available at \url{https://sites.google.com/view/loop-copilot}
Multimedia
☆ MultiMediate'24: Multi-Domain Engagement Estimation
Estimating the momentary level of participant's engagement is an important prerequisite for assistive systems that support human interactions. Previous work has addressed this task in within-domain evaluation scenarios, i.e. training and testing on the same dataset. This is in contrast to real-life scenarios where domain shifts between training and testing data frequently occur. With MultiMediate'24, we present the first challenge addressing multi-domain engagement estimation. As training data, we utilise the NOXI database of dyadic novice-expert interactions. In addition to within-domain test data, we add two new test domains. First, we introduce recordings following the NOXI protocol but covering languages that are not present in the NOXI training data. Second, we collected novel engagement annotations on the MPIIGroupInteraction dataset which consists of group discussions between three to four people. In this way, MultiMediate'24 evaluates the ability of approaches to generalise across factors such as language and cultural background, group size, task, and screen-mediated vs. face-to-face interaction. This paper describes the MultiMediate'24 challenge and presents baseline results. In addition, we discuss selected challenge solutions.
comment: arXiv admin note: text overlap with arXiv:2308.08256
☆ Human-Inspired Audio-Visual Speech Recognition: Spike Activity, Cueing Interaction and Causal Processing
Humans naturally perform audiovisual speech recognition (AVSR), enhancing the accuracy and robustness by integrating auditory and visual information. Spiking neural networks (SNNs), which mimic the brain's information-processing mechanisms, are well-suited for emulating the human capability of AVSR. Despite their potential, research on SNNs for AVSR is scarce, with most existing audio-visual multimodal methods focused on object or digit recognition. These models simply integrate features from both modalities, neglecting their unique characteristics and interactions. Additionally, they often rely on future information for current processing, which increases recognition latency and limits real-time applicability. Inspired by human speech perception, this paper proposes a novel human-inspired SNN named HI-AVSNN for AVSR, incorporating three key characteristics: cueing interaction, causal processing and spike activity. For cueing interaction, we propose a visual-cued auditory attention module (VCA2M) that leverages visual cues to guide attention to auditory features. We achieve causal processing by aligning the SNN's temporal dimension with that of visual and auditory features and applying temporal masking to utilize only past and current information. To implement spike activity, in addition to using SNNs, we leverage the event camera to capture lip movement as spikes, mimicking the human retina and providing efficient visual data. We evaluate HI-AVSNN on an audiovisual speech recognition dataset combining the DVS-Lip dataset with its corresponding audio samples. Experimental results demonstrate the superiority of our proposed fusion method, outperforming existing audio-visual SNN fusion methods and achieving a 2.27% improvement in accuracy over the only existing SNN-based AVSR method.
☆ WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling
Language models have been effectively applied to modeling natural signals, such as images, video, speech, and audio. A crucial component of these models is the codec tokenizer, which compresses high-dimensional natural signals into lower-dimensional discrete tokens. In this paper, we introduce WavTokenizer, which offers several advantages over previous SOTA acoustic codec models in the audio domain: 1)extreme compression. By compressing the layers of quantizers and the temporal dimension of the discrete codec, one-second audio of 24kHz sampling rate requires only a single quantizer with 40 or 75 tokens. 2)improved subjective quality. Despite the reduced number of tokens, WavTokenizer achieves state-of-the-art reconstruction quality with outstanding UTMOS scores and inherently contains richer semantic information. Specifically, we achieve these results by designing a broader VQ space, extended contextual windows, and improved attention networks, as well as introducing a powerful multi-scale discriminator and an inverse Fourier transform structure. We conducted extensive reconstruction experiments in the domains of speech, audio, and music. WavTokenizer exhibited strong performance across various objective and subjective metrics compared to state-of-the-art models. We also tested semantic information, VQ utilization, and adaptability to generative models. Comprehensive ablation studies confirm the necessity of each module in WavTokenizer. The related code, demos, and pre-trained models are available at https://github.com/jishengpeng/WavTokenizer.
comment: Working in progress. arXiv admin note: text overlap with arXiv:2402.12208
☆ MSLIQA: Enhancing Learning Representations for Image Quality Assessment through Multi-Scale Learning
No-Reference Image Quality Assessment (NR-IQA) remains a challenging task due to the diversity of distortions and the lack of large annotated datasets. Many studies have attempted to tackle these challenges by developing more accurate NR-IQA models, often employing complex and computationally expensive networks, or by bridging the domain gap between various distortions to enhance performance on test datasets. In our work, we improve the performance of a generic lightweight NR-IQA model by introducing a novel augmentation strategy that boosts its performance by almost 28\%. This augmentation strategy enables the network to better discriminate between different distortions in various parts of the image by zooming in and out. Additionally, the inclusion of test-time augmentation further enhances performance, making our lightweight network's results comparable to the current state-of-the-art models, simply through the use of augmentations.
☆ See or Guess: Counterfactually Regularized Image Captioning ACM MM 2024
Image captioning, which generates natural language descriptions of the visual information in an image, is a crucial task in vision-language research. Previous models have typically addressed this task by aligning the generative capabilities of machines with human intelligence through statistical fitting of existing datasets. While effective for normal images, they may struggle to accurately describe those where certain parts of the image are obscured or edited, unlike humans who excel in such cases. These weaknesses they exhibit, including hallucinations and limited interpretability, often hinder performance in scenarios with shifted association patterns. In this paper, we present a generic image captioning framework that employs causal inference to make existing models more capable of interventional tasks, and counterfactually explainable. Our approach includes two variants leveraging either total effect or natural direct effect. Integrating them into the training process enables models to handle counterfactual scenarios, increasing their generalizability. Extensive experiments on various datasets show that our method effectively reduces hallucinations and improves the model's faithfulness to images, demonstrating high portability across both small-scale and large-scale image-to-text models. The code is available at https://github.com/Aman-4-Real/See-or-Guess.
comment: Accepted by ACM MM 2024
Computation and Language
☆ CoGen: Learning from Feedback with Coupled Comprehension and Generation
Systems with both language comprehension and generation capabilities can benefit from the tight connection between the two. This work studies coupling comprehension and generation with focus on continually learning from interaction with users. We propose techniques to tightly integrate the two capabilities for both learning and inference. We situate our studies in two-player reference games, and deploy various models for thousands of interactions with human users, while learning from interaction feedback signals. We show dramatic improvements in performance over time, with comprehension-generation coupling leading to performance improvements up to 26% in absolute terms and up to 17% higher accuracies compared to a non-coupled system. Our analysis also shows coupling has substantial qualitative impact on the system's language, making it significantly more human-like.
comment: 17 pages, 9 figures
☆ BattleAgentBench: A Benchmark for Evaluating Cooperation and Competition Capabilities of Language Models in Multi-Agent Systems
Large Language Models (LLMs) are becoming increasingly powerful and capable of handling complex tasks, e.g., building single agents and multi-agent systems. Compared to single agents, multi-agent systems have higher requirements for the collaboration capabilities of language models. Many benchmarks are proposed to evaluate their collaborative abilities. However, these benchmarks lack fine-grained evaluations of LLM collaborative capabilities. Additionally, multi-agent collaborative and competitive scenarios are ignored in existing works. To address these two problems, we propose a benchmark, called BattleAgentBench, which defines seven sub-stages of three varying difficulty levels and conducts a fine-grained evaluation of language models in terms of single-agent scenario navigation capabilities, paired-agent task execution abilities, and multi-agent collaboration and competition capabilities. We conducted extensive evaluations on leading four closed-source and seven open-source models. Experimental results indicate that API-based models perform excellently on simple tasks but open-source small models struggle with simple tasks. Regarding difficult tasks that require collaborative and competitive abilities, although API-based models have demonstrated some collaborative capabilities, there is still enormous room for improvement.
☆ More Text, Less Point: Towards 3D Data-Efficient Point-Language Understanding
Enabling Large Language Models (LLMs) to comprehend the 3D physical world remains a significant challenge. Due to the lack of large-scale 3D-text pair datasets, the success of LLMs has yet to be replicated in 3D understanding. In this paper, we rethink this issue and propose a new task: 3D Data-Efficient Point-Language Understanding. The goal is to enable LLMs to achieve robust 3D object understanding with minimal 3D point cloud and text data pairs. To address this task, we introduce GreenPLM, which leverages more text data to compensate for the lack of 3D data. First, inspired by using CLIP to align images and text, we utilize a pre-trained point cloud-text encoder to map the 3D point cloud space to the text space. This mapping leaves us to seamlessly connect the text space with LLMs. Once the point-text-LLM connection is established, we further enhance text-LLM alignment by expanding the intermediate text space, thereby reducing the reliance on 3D point cloud data. Specifically, we generate 6M free-text descriptions of 3D objects, and design a three-stage training strategy to help LLMs better explore the intrinsic connections between different modalities. To achieve efficient modality alignment, we design a zero-parameter cross-attention module for token pooling. Extensive experimental results show that GreenPLM requires only 12% of the 3D training data used by existing state-of-the-art models to achieve superior 3D understanding. Remarkably, GreenPLM also achieves competitive performance using text-only data. The code and weights are available at: https://github.com/TangYuan96/GreenPLM.
☆ Leveraging Open Knowledge for Advancing Task Expertise in Large Language Models
The cultivation of expertise for large language models (LLMs) to solve tasks of specific areas often requires special-purpose tuning with calibrated behaviors on the expected stable outputs. To avoid huge cost brought by manual preparation of instruction datasets and training resources up to hundreds of hours, the exploitation of open knowledge including a wealth of low rank adaptation (LoRA) models and instruction datasets serves as a good starting point. However, existing methods on model and data selection focus on the performance of general-purpose capabilities while neglecting the knowledge gap exposed in domain-specific deployment. In the present study, we propose to bridge such gap by introducing few human-annotated samples (i.e., K-shot) for advancing task expertise of LLMs with open knowledge. Specifically, we develop an efficient and scalable pipeline to cost-efficiently produce task experts where K-shot data intervene in selecting the most promising expert candidates and the task-relevant instructions. A mixture-of-expert (MoE) system is built to make the best use of individual-yet-complementary knowledge between multiple experts. We unveil the two keys to the success of a MoE system, 1) the abidance by K-shot, and 2) the insistence on diversity. For the former, we ensure that models that truly possess problem-solving abilities on K-shot are selected rather than those blind guessers. Besides, during data selection, instructions that share task-relevant contexts with K-shot are prioritized. For the latter, we highlight the diversity of constituting experts and that of the fine-tuning instructions throughout the model and data selection process. Extensive experimental results confirm the superiority of our approach over existing methods on utilization of open knowledge across various tasks. Codes and models will be released later.
comment: 28 pages, 12 tables, 10 figures
☆ LLM-Based Multi-Hop Question Answering with Knowledge Graph Integration in Evolving Environments
The rapid obsolescence of information in Large Language Models (LLMs) has driven the development of various techniques to incorporate new facts. However, existing methods for knowledge editing still face difficulties with multi-hop questions that require accurate fact identification and sequential logical reasoning, particularly among numerous fact updates. To tackle these challenges, this paper introduces Graph Memory-based Editing for Large Language Models (GMeLLo), a straitforward and effective method that merges the explicit knowledge representation of Knowledge Graphs (KGs) with the linguistic flexibility of LLMs. Beyond merely leveraging LLMs for question answering, GMeLLo employs these models to convert free-form language into structured queries and fact triples, facilitating seamless interaction with KGs for rapid updates and precise multi-hop reasoning. Our results show that GMeLLo significantly surpasses current state-of-the-art knowledge editing methods in the multi-hop question answering benchmark, MQuAKE, especially in scenarios with extensive knowledge edits.
☆ Nexus: Specialization meets Adaptability for Efficiently Training Mixture of Experts
Efficiency, specialization, and adaptability to new data distributions are qualities that are hard to combine in current Large Language Models. The Mixture of Experts (MoE) architecture has been the focus of significant research because its inherent conditional computation enables such desirable properties. In this work, we focus on "upcycling" dense expert models into an MoE, aiming to improve specialization while also adding the ability to adapt to new tasks easily. We introduce Nexus, an enhanced MoE architecture with adaptive routing where the model learns to project expert embeddings from domain representations. This approach allows Nexus to flexibly add new experts after the initial upcycling through separately trained dense models, without requiring large-scale MoE training for unseen data domains. Our experiments show that Nexus achieves a relative gain of up to 2.1% over the baseline for initial upcycling, and a 18.8% relative gain for extending the MoE with a new expert by using limited finetuning data. This flexibility of Nexus is crucial to enable an open-source ecosystem where every user continuously assembles their own MoE-mix according to their needs.
☆ A New Method for Cross-Lingual-based Semantic Role Labeling
Semantic role labeling is a crucial task in natural language processing, enabling better comprehension of natural language. However, the lack of annotated data in multiple languages has posed a challenge for researchers. To address this, a deep learning algorithm based on model transfer has been proposed. The algorithm utilizes a dataset consisting of the English portion of CoNLL2009 and a corpus of semantic roles in Persian. To optimize the efficiency of training, only ten percent of the educational data from each language is used. The results of the proposed model demonstrate significant improvements compared to Niksirt et al.'s model. In monolingual mode, the proposed model achieved a 2.05 percent improvement on F1-score, while in cross-lingual mode, the improvement was even more substantial, reaching 6.23 percent. Worth noting is that the compared model only trained two of the four stages of semantic role labeling and employed golden data for the remaining two stages. This suggests that the actual superiority of the proposed model surpasses the reported numbers by a significant margin. The development of cross-lingual methods for semantic role labeling holds promise, particularly in addressing the scarcity of annotated data for various languages. These advancements pave the way for further research in understanding and processing natural language across different linguistic contexts.
☆ Bias in LLMs as Annotators: The Effect of Party Cues on Labelling Decision by Large Language Models
Human coders are biased. We test similar biases in Large Language Models (LLMs) as annotators. By replicating an experiment run by Ennser-Jedenastik and Meyer (2018), we find evidence that LLMs use political information, and specifically party cues, to judge political statements. Not only do LLMs use relevant information to contextualize whether a statement is positive, negative, or neutral based on the party cue, they also reflect the biases of the human-generated data upon which they have been trained. We also find that unlike humans, who are only biased when faced with statements from extreme parties, LLMs exhibit significant bias even when prompted with statements from center-left and center-right parties. The implications of our findings are discussed in the conclusion.
☆ Persuasion Games using Large Language Models
Large Language Models (LLMs) have emerged as formidable instruments capable of comprehending and producing human-like text. This paper explores the potential of LLMs, to shape human perspectives and subsequently influence their decisions on particular tasks. This capability finds applications in diverse domains such as Investment, Credit cards and Insurance, wherein they assist users in selecting appropriate insurance policies, investment plans, Credit cards, Retail, as well as in Behavioral Change Support Systems (BCSS). We present a sophisticated multi-agent framework wherein a consortium of agents operate in collaborative manner. The primary agent engages directly with users through persuasive dialogue, while the auxiliary agents perform tasks such as information retrieval, response analysis, development of persuasion strategies, and validation of facts. Empirical evidence from our experiments demonstrates that this collaborative methodology significantly enhances the persuasive efficacy of the LLM. We analyze user resistance to persuasive efforts continuously and counteract it by employing a combination of rule-based and LLM-based resistance-persuasion mapping techniques. We employ simulated personas and generate conversations in insurance, banking, and retail domains to evaluate the proficiency of large language models (LLMs) in recognizing, adjusting to, and influencing various personality types. Concurrently, we examine the resistance mechanisms employed by LLM simulated personas. Persuasion is quantified via measurable surveys before and after interaction, LLM-generated scores on conversation, and user decisions (purchase or non-purchase).
☆ Knowledge Navigator: LLM-guided Browsing Framework for Exploratory Search in Scientific Literature
The exponential growth of scientific literature necessitates advanced tools for effective knowledge exploration. We present Knowledge Navigator, a system designed to enhance exploratory search abilities by organizing and structuring the retrieved documents from broad topical queries into a navigable, two-level hierarchy of named and descriptive scientific topics and subtopics. This structured organization provides an overall view of the research themes in a domain, while also enabling iterative search and deeper knowledge discovery within specific subtopics by allowing users to refine their focus and retrieve additional relevant documents. Knowledge Navigator combines LLM capabilities with cluster-based methods to enable an effective browsing method. We demonstrate our approach's effectiveness through automatic and manual evaluations on two novel benchmarks, CLUSTREC-COVID and SCITOC. Our code, prompts, and benchmarks are made publicly available.
☆ Automatic Differential Diagnosis using Transformer-Based Multi-Label Sequence Classification
As the field of artificial intelligence progresses, assistive technologies are becoming more widely used across all industries. The healthcare industry is no different, with numerous studies being done to develop assistive tools for healthcare professionals. Automatic diagnostic systems are one such beneficial tool that can assist with a variety of tasks, including collecting patient information, analyzing test results, and diagnosing patients. However, the idea of developing systems that can provide a differential diagnosis has been largely overlooked in most of these research studies. In this study, we propose a transformer-based approach for providing differential diagnoses based on a patient's age, sex, medical history, and symptoms. We use the DDXPlus dataset, which provides differential diagnosis information for patients based on 49 disease types. Firstly, we propose a method to process the tabular patient data from the dataset and engineer them into patient reports to make them suitable for our research. In addition, we introduce two data modification modules to diversify the training data and consequently improve the robustness of the models. We approach the task as a multi-label classification problem and conduct extensive experiments using four transformer models. All the models displayed promising results by achieving over 97% F1 score on the held-out test set. Moreover, we design additional behavioral tests to get a broader understanding of the models. In particular, for one of our test cases, we prepared a custom test set of 100 samples with the assistance of a doctor. The results on the custom set showed that our proposed data modification modules improved the model's generalization capabilities. We hope our findings will provide future researchers with valuable insights and inspire them to develop reliable systems for automatic differential diagnosis.
comment: 25 pages, 7 figures
☆ FRACTURED-SORRY-Bench: Framework for Revealing Attacks in Conversational Turns Undermining Refusal Efficacy and Defenses over SORRY-Bench
This paper introduces FRACTURED-SORRY-Bench, a framework for evaluating the safety of Large Language Models (LLMs) against multi-turn conversational attacks. Building upon the SORRY-Bench dataset, we propose a simple yet effective method for generating adversarial prompts by breaking down harmful queries into seemingly innocuous sub-questions. Our approach achieves a maximum increase of +46.22\% in Attack Success Rates (ASRs) across GPT-4, GPT-4o, GPT-4o-mini, and GPT-3.5-Turbo models compared to baseline methods. We demonstrate that this technique poses a challenge to current LLM safety measures and highlights the need for more robust defenses against subtle, multi-turn attacks.
comment: 4 pages, 2 tables
☆ Evaluating Computational Representations of Character: An Austen Character Similarity Benchmark
Several systems have been developed to extract information about characters to aid computational analysis of English literature. We propose character similarity grouping as a holistic evaluation task for these pipelines. We present AustenAlike, a benchmark suite of character similarities in Jane Austen's novels. Our benchmark draws on three notions of character similarity: a structurally defined notion of similarity; a socially defined notion of similarity; and an expert defined set extracted from literary criticism. We use AustenAlike to evaluate character features extracted using two pipelines, BookNLP and FanfictionNLP. We build character representations from four kinds of features and compare them to the three AustenAlike benchmarks and to GPT-4 similarity rankings. We find that though computational representations capture some broad similarities based on shared social and narrative roles, the expert pairings in our third benchmark are challenging for all systems, highlighting the subtler aspects of similarity noted by human readers.
☆ Structured Event Reasoning with Large Language Models
Reasoning about real-life events is a unifying challenge in AI and NLP that has profound utility in a variety of domains, while fallacy in high-stake applications could be catastrophic. Able to work with diverse text in these domains, large language models (LLMs) have proven capable of answering questions and solving problems. However, I show that end-to-end LLMs still systematically fail to reason about complex events, and they lack interpretability due to their black-box nature. To address these issues, I propose three general approaches to use LLMs in conjunction with a structured representation of events. The first is a language-based representation involving relations of sub-events that can be learned by LLMs via fine-tuning. The second is a semi-symbolic representation involving states of entities that can be predicted and leveraged by LLMs via few-shot prompting. The third is a fully symbolic representation that can be predicted by LLMs trained with structured data and be executed by symbolic solvers. On a suite of event reasoning tasks spanning common-sense inference and planning, I show that each approach greatly outperforms end-to-end LLMs with more interpretability. These results suggest manners of synergy between LLMs and structured representations for event reasoning and beyond.
comment: PhD thesis
☆ Is Personality Prediction Possible Based on Reddit Comments?
In this assignment, we examine whether there is a correlation between the personality type of a person and the texts they wrote. In order to do this, we aggregated datasets of Reddit comments labeled with the Myers-Briggs Type Indicator (MBTI) of the author and built different supervised classifiers based on BERT to try to predict the personality of an author given a text. Despite experiencing issues with the unfiltered character of the dataset, we can observe potential in the classification.
☆ Logic-Enhanced Language Model Agents for Trustworthy Social Simulations
We introduce the Logic-Enhanced Language Model Agents (LELMA) framework, a novel approach to enhance the trustworthiness of social simulations that utilize large language models (LLMs). While LLMs have gained attention as agents for simulating human behaviour, their applicability in this role is limited by issues such as inherent hallucinations and logical inconsistencies. LELMA addresses these challenges by integrating LLMs with symbolic AI, enabling logical verification of the reasoning generated by LLMs. This verification process provides corrective feedback, refining the reasoning output. The framework consists of three main components: an LLM-Reasoner for producing strategic reasoning, an LLM-Translator for mapping natural language reasoning to logic queries, and a Solver for evaluating these queries. This study focuses on decision-making in game-theoretic scenarios as a model of human interaction. Experiments involving the Hawk-Dove game, Prisoner's Dilemma, and Stag Hunt highlight the limitations of state-of-the-art LLMs, GPT-4 Omni and Gemini 1.0 Pro, in producing correct reasoning in these contexts. LELMA demonstrates high accuracy in error detection and improves the reasoning correctness of LLMs via self-refinement, particularly in GPT-4 Omni.
comment: Source code: https://github.com/dicelab-rhul/LELMA
☆ Using Large Language Models to Create AI Personas for Replication and Prediction of Media Effects: An Empirical Test of 133 Published Experimental Research Findings
This report analyzes the potential for large language models (LLMs) to expedite accurate replication of published message effects studies. We tested LLM-powered participants (personas) by replicating 133 experimental findings from 14 papers containing 45 recent studies in the Journal of Marketing (January 2023-May 2024). We used a new software tool, Viewpoints AI (https://viewpoints.ai/), that takes study designs, stimuli, and measures as input, automatically generates prompts for LLMs to act as a specified sample of unique personas, and collects their responses to produce a final output in the form of a complete dataset and statistical analysis. The underlying LLM used was Anthropic's Claude Sonnet 3.5. We generated 19,447 AI personas to replicate these studies with the exact same sample attributes, study designs, stimuli, and measures reported in the original human research. Our LLM replications successfully reproduced 76% of the original main effects (84 out of 111), demonstrating strong potential for AI-assisted replication of studies in which people respond to media stimuli. When including interaction effects, the overall replication rate was 68% (90 out of 133). The use of LLMs to replicate and accelerate marketing research on media effects is discussed with respect to the replication crisis in social science, potential solutions to generalizability problems in sampling subjects and experimental conditions, and the ability to rapidly test consumer responses to various media stimuli. We also address the limitations of this approach, particularly in replicating complex interaction effects in media response studies, and suggest areas for future research and improvement in AI-assisted experimental replication of media effects.
comment: 24 pages, 3 figures, 2 tables
☆ Scaling Up Summarization: Leveraging Large Language Models for Long Text Extractive Summarization
In an era where digital text is proliferating at an unprecedented rate, efficient summarization tools are becoming indispensable. While Large Language Models (LLMs) have been successfully applied in various NLP tasks, their role in extractive text summarization remains underexplored. This paper introduces EYEGLAXS (Easy Yet Efficient larGe LAnguage model for eXtractive Summarization), a framework that leverages LLMs, specifically LLAMA2-7B and ChatGLM2-6B, for extractive summarization of lengthy text documents. Instead of abstractive methods, which often suffer from issues like factual inaccuracies and hallucinations, EYEGLAXS focuses on extractive summarization to ensure factual and grammatical integrity. Utilizing state-of-the-art techniques such as Flash Attention and Parameter-Efficient Fine-Tuning (PEFT), EYEGLAXS addresses the computational and resource challenges typically associated with LLMs. The system sets new performance benchmarks on well-known datasets like PubMed and ArXiv. Furthermore, we extend our research through additional analyses that explore the adaptability of LLMs in handling different sequence lengths and their efficiency in training on smaller datasets. These contributions not only set a new standard in the field but also open up promising avenues for future research in extractive text summarization.
☆ Language Adaptation on a Tight Academic Compute Budget: Tokenizer Swapping Works and Pure bfloat16 Is Enough ICML 2024
We investigate continued pretraining of LLMs for language adaptation on a tight academic budget: a setting in which only a few GPUs can be used in parallel, for a heavily constrained duration. We focus on adapting Mistral-7B to German or Arabic and evaluate several techniques to improve efficiency and effectiveness in this setting. Our German models adapted on this tight compute budget underperform compared to the base Mistral-7B, while our Arabic models outperform several baselines, showing that for sufficiently well-represented languages, continued pretraining for specialization is not always helpful. Our main findings focus on training precision and tokenizer swapping. Our results show that pure bfloat16 training is a viable alternative to mixed-precision training, while being much faster when only using a few GPUs. Swapping the tokenizer for a specialized one yields more efficient tokenization and is competitive with the original tokenizer, which already contains some German tokens, but did not significantly increase performance for German. Code and model weights are available at on GitHub.
comment: WANT@ICML 2024
☆ Interactive Agents: Simulating Counselor-Client Psychological Counseling via Role-Playing LLM-to-LLM Interactions
Virtual counselors powered by large language models (LLMs) aim to create interactive support systems that effectively assist clients struggling with mental health challenges. To replicate counselor-client conversations, researchers have built an online mental health platform that allows professional counselors to provide clients with text-based counseling services for about an hour per session. Notwithstanding its effectiveness, challenges exist as human annotation is time-consuming, cost-intensive, privacy-protected, and not scalable. To address this issue and investigate the applicability of LLMs in psychological counseling conversation simulation, we propose a framework that employs two LLMs via role-playing for simulating counselor-client interactions. Our framework involves two LLMs, one acting as a client equipped with a specific and real-life user profile and the other playing the role of an experienced counselor, generating professional responses using integrative therapy techniques. We implement both the counselor and the client by zero-shot prompting the GPT-4 model. In order to assess the effectiveness of LLMs in simulating counselor-client interactions and understand the disparities between LLM- and human-generated conversations, we evaluate the synthetic data from various perspectives. We begin by assessing the client's performance through automatic evaluations. Next, we analyze and compare the disparities between dialogues generated by the LLM and those generated by professional counselors. Furthermore, we conduct extensive experiments to thoroughly examine the performance of our LLM-based counselor trained with synthetic interactive dialogues by benchmarking against state-of-the-art models for mental health.
☆ LogicGame: Benchmarking Rule-Based Reasoning Abilities of Large Language Models
Large Language Models (LLMs) have demonstrated notable capabilities across various tasks, showcasing complex problem-solving abilities. Understanding and executing complex rules, along with multi-step planning, are fundamental to logical reasoning and critical for practical LLM agents and decision-making systems. However, evaluating LLMs as effective rule-based executors and planners remains underexplored. In this paper, we introduce LogicGame, a novel benchmark designed to evaluate the comprehensive rule understanding, execution, and planning capabilities of LLMs. Unlike traditional benchmarks, LogicGame provides diverse games that contain a series of rules with an initial state, requiring models to comprehend and apply predefined regulations to solve problems. We create simulated scenarios in which models execute or plan operations to achieve specific outcomes. These game scenarios are specifically designed to distinguish logical reasoning from mere knowledge by relying exclusively on predefined rules. This separation allows for a pure assessment of rule-based reasoning capabilities. The evaluation considers not only final outcomes but also intermediate steps, providing a comprehensive assessment of model performance. Moreover, these intermediate steps are deterministic and can be automatically verified. LogicGame defines game scenarios with varying difficulty levels, from simple rule applications to complex reasoning chains, in order to offer a precise evaluation of model performance on rule understanding and multi-step execution. Utilizing LogicGame, we test various LLMs and identify notable shortcomings in their rule-based logical reasoning abilities.
☆ A Survey on Evaluation of Multimodal Large Language Models
Multimodal Large Language Models (MLLMs) mimic human perception and reasoning system by integrating powerful Large Language Models (LLMs) with various modality encoders (e.g., vision, audio), positioning LLMs as the "brain" and various modality encoders as sensory organs. This framework endows MLLMs with human-like capabilities, and suggests a potential pathway towards achieving artificial general intelligence (AGI). With the emergence of all-round MLLMs like GPT-4V and Gemini, a multitude of evaluation methods have been developed to assess their capabilities across different dimensions. This paper presents a systematic and comprehensive review of MLLM evaluation methods, covering the following key aspects: (1) the background of MLLMs and their evaluation; (2) "what to evaluate" that reviews and categorizes existing MLLM evaluation tasks based on the capabilities assessed, including general multimodal recognition, perception, reasoning and trustworthiness, and domain-specific applications such as socioeconomic, natural sciences and engineering, medical usage, AI agent, remote sensing, video and audio processing, 3D point cloud analysis, and others; (3) "where to evaluate" that summarizes MLLM evaluation benchmarks into general and specific benchmarks; (4) "how to evaluate" that reviews and illustrates MLLM evaluation steps and metrics; Our overarching goal is to provide valuable insights for researchers in the field of MLLM evaluation, thereby facilitating the development of more capable and reliable MLLMs. We emphasize that evaluation should be regarded as a critical discipline, essential for advancing the field of MLLMs.
☆ Harmonized Speculative Sampling
Speculative sampling has proven to be an effective solution to accelerate decoding from large language models, where the acceptance rate significantly determines the performance. Most previous works on improving the acceptance rate focus on aligned training and efficient decoding, implicitly paying less attention to the linkage of training and decoding. In this work, we first investigate the linkage of training and decoding for speculative sampling and then propose a solution named HArmonized Speculative Sampling (HASS). HASS improves the acceptance rate without extra inference overhead by harmonizing training and decoding on their objectives and contexts. Experiments on three LLaMA models demonstrate that HASS achieves 2.81x-3.65x wall-clock time speedup ratio averaging across three datasets, which is 8%-15% faster than EAGLE-2.
☆ Form and meaning co-determine the realization of tone in Taiwan Mandarin spontaneous speech: the case of Tone 3 sandhi
In Standard Chinese, Tone 3 (the dipping tone) becomes Tone 2 (rising tone) when followed by another Tone 3. Previous studies have noted that this sandhi process may be incomplete, in the sense that the assimilated Tone 3 is still distinct from a true Tone 2. While Mandarin Tone 3 sandhi is widely studied using carefully controlled laboratory speech (Xu, 1997) and more formal registers of Beijing Mandarin (Yuan and Chen, 2014), less is known about its realization in spontaneous speech, and about the effect of contextual factors on tonal realization. The present study investigates the pitch contours of two-character words with T2-T3 and T3-T3 tone patterns in spontaneous Taiwan Mandarin conversations. Our analysis makes use of the Generative Additive Mixed Model (GAMM, Wood, 2017) to examine fundamental frequency (f0) contours as a function of normalized time. We consider various factors known to influence pitch contours, including gender, speaking rate, speaker, neighboring tones, word position, bigram probability, and also novel predictors, word and word sense (Chuang et al., 2024). Our analyses revealed that in spontaneous Taiwan Mandarin, T3-T3 words become indistinguishable from T2-T3 words, indicating complete sandhi, once the strong effect of word (or word sense) is taken into account. For our data, the shape of f0 contours is not co-determined by word frequency. In contrast, the effect of word meaning on f0 contours is robust, as strong as the effect of adjacent tones, and is present for both T2-T3 and T3-T3 words.
☆ LM-PUB-QUIZ: A Comprehensive Framework for Zero-Shot Evaluation of Relational Knowledge in Language Models
Knowledge probing evaluates the extent to which a language model (LM) has acquired relational knowledge during its pre-training phase. It provides a cost-effective means of comparing LMs of different sizes and training setups and is useful for monitoring knowledge gained or lost during continual learning (CL). In prior work, we presented an improved knowledge probe called BEAR (Wiland et al., 2024), which enables the comparison of LMs trained with different pre-training objectives (causal and masked LMs) and addresses issues of skewed distributions in previous probes to deliver a more unbiased reading of LM knowledge. With this paper, we present LM-PUB- QUIZ, a Python framework and leaderboard built around the BEAR probing mechanism that enables researchers and practitioners to apply it in their work. It provides options for standalone evaluation and direct integration into the widely-used training pipeline of the Hugging Face TRANSFORMERS library. Further, it provides a fine-grained analysis of different knowledge types to assist users in better understanding the knowledge in each evaluated LM. We publicly release LM-PUB-QUIZ as an open-source project.
☆ An Evaluation of Sindhi Word Embedding in Semantic Analogies and Downstream Tasks
In this paper, we propose a new word embedding based corpus consisting of more than 61 million words crawled from multiple web resources. We design a preprocessing pipeline for the filtration of unwanted text from crawled data. Afterwards, the cleaned vocabulary is fed to state-of-the-art continuous-bag-of-words, skip-gram, and GloVe word embedding algorithms. For the evaluation of pretrained embeddings, we use popular intrinsic and extrinsic evaluation approaches. The evaluation results reveal that continuous-bag-of-words and skip-gram perform better than GloVe and existing Sindhi fastText word embedding on both intrinsic and extrinsic evaluation approaches
comment: arXiv admin note: substantial text overlap with arXiv:1911.12579
☆ TempoFormer: A Transformer for Temporally-aware Representations in Change Detection
Dynamic representation learning plays a pivotal role in understanding the evolution of linguistic content over time. On this front both context and time dynamics as well as their interplay are of prime importance. Current approaches model context via pre-trained representations, which are typically temporally agnostic. Previous work on modeling context and temporal dynamics has used recurrent methods, which are slow and prone to overfitting. Here we introduce TempoFormer, the fist task-agnostic transformer-based and temporally-aware model for dynamic representation learning. Our approach is jointly trained on inter and intra context dynamics and introduces a novel temporal variation of rotary positional embeddings. The architecture is flexible and can be used as the temporal representation foundation of other models or applied to different transformer-based architectures. We show new SOTA performance on three different real-time change detection tasks.
☆ StyleRemix: Interpretable Authorship Obfuscation via Distillation and Perturbation of Style Elements
Authorship obfuscation, rewriting a text to intentionally obscure the identity of the author, is an important but challenging task. Current methods using large language models (LLMs) lack interpretability and controllability, often ignoring author-specific stylistic features, resulting in less robust performance overall. To address this, we develop StyleRemix, an adaptive and interpretable obfuscation method that perturbs specific, fine-grained style elements of the original input text. StyleRemix uses pre-trained Low Rank Adaptation (LoRA) modules to rewrite an input specifically along various stylistic axes (e.g., formality and length) while maintaining low computational cost. StyleRemix outperforms state-of-the-art baselines and much larger LLMs in a variety of domains as assessed by both automatic and human evaluation. Additionally, we release AuthorMix, a large set of 30K high-quality, long-form texts from a diverse set of 14 authors and 4 domains, and DiSC, a parallel corpus of 1,500 texts spanning seven style axes in 16 unique directions
☆ Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts
For Mixture-of-Experts (MoE) models, an unbalanced expert load will lead to routing collapse or increased computational overhead. Existing methods commonly employ an auxiliary loss to encourage load balance, but a large auxiliary loss will introduce non-negligible interference gradients into training and thus impair the model performance. In order to control load balance while not producing undesired gradients during training, we propose Loss-Free Balancing, featured by an auxiliary-loss-free load balancing strategy. To be specific, before the top-K routing decision, Loss-Free Balancing will first apply an expert-wise bias to the routing scores of each expert. By dynamically updating the bias of each expert according to its recent load, Loss-Free Balancing can consistently maintain a balanced distribution of expert load. In addition, since Loss-Free Balancing does not produce any interference gradients, it also elevates the upper bound of model performance gained from MoE training. We validate the performance of Loss-Free Balancing on MoE models with up to 3B parameters trained on up to 200B tokens. Experimental results show that Loss-Free Balancing achieves both better performance and better load balance compared with traditional auxiliary-loss-controlled load balancing strategies.
☆ Harnessing the Intrinsic Knowledge of Pretrained Language Models for Challenging Text Classification Settings
Text classification is crucial for applications such as sentiment analysis and toxic text filtering, but it still faces challenges due to the complexity and ambiguity of natural language. Recent advancements in deep learning, particularly transformer architectures and large-scale pretraining, have achieved inspiring success in NLP fields. Building on these advancements, this thesis explores three challenging settings in text classification by leveraging the intrinsic knowledge of pretrained language models (PLMs). Firstly, to address the challenge of selecting misleading yet incorrect distractors for cloze questions, we develop models that utilize features based on contextualized word representations from PLMs, achieving performance that rivals or surpasses human accuracy. Secondly, to enhance model generalization to unseen labels, we create small finetuning datasets with domain-independent task label descriptions, improving model performance and robustness. Lastly, we tackle the sensitivity of large language models to in-context learning prompts by selecting effective demonstrations, focusing on misclassified examples and resolving model ambiguity regarding test example labels.
comment: PhD thesis
☆ CBF-LLM: Safe Control for LLM Alignment
This paper proposes a control-based framework for aligning large language models (LLMs) by leveraging a control barrier function (CBF) to ensure user-desirable text generation. The presented framework applies the safety filter, designed based on the CBF, to the output generation of the baseline LLM, i.e., the sequence of the token, with the aim of intervening in the generated text. The overall text-generation system is implemented with Llama 3 and a RoBERTa model, and the source code is available at https://github.com/Mya-Mya/CBF-LLM. The experiment demonstrates its control ability and effectiveness in reducing the number of interventions needed for user-specified alignment tasks.
☆ Beyond Levenshtein: Leveraging Multiple Algorithms for Robust Word Error Rate Computations And Granular Error Classifications INTERSPEECH 2024
The Word Error Rate (WER) is the common measure of accuracy for Automatic Speech Recognition (ASR). Transcripts are usually pre-processed by substituting specific characters to account for non-semantic differences. As a result of this normalisation, information on the accuracy of punctuation or capitalisation is lost. We present a non-destructive, token-based approach using an extended Levenshtein distance algorithm to compute a robust WER and additional orthographic metrics. Transcription errors are also classified more granularly by existing string similarity and phonetic algorithms. An evaluation on several datasets demonstrates the practical equivalence of our approach compared to common WER computations. We also provide an exemplary analysis of derived use cases, such as a punctuation error rate, and a web application for interactive use and visualisation of our implementation. The code is available open-source.
comment: Accepted in INTERSPEECH 2024
☆ SIaM: Self-Improving Code-Assisted Mathematical Reasoning of Large Language Models
There is a growing trend of teaching large language models (LLMs) to solve mathematical problems through coding. Existing studies primarily focus on prompting powerful, closed-source models to generate seed training data followed by in-domain data augmentation, equipping LLMs with considerable capabilities for code-aided mathematical reasoning. However, continually training these models on augmented data derived from a few datasets such as GSM8K may impair their generalization abilities and restrict their effectiveness to a narrow range of question types. Conversely, the potential of improving such LLMs by leveraging large-scale, expert-written, diverse math question-answer pairs remains unexplored. To utilize these resources and tackle unique challenges such as code response assessment, we propose a novel paradigm that uses a code-based critic model to guide steps including question-code data construction, quality control, and complementary evaluation. We also explore different alignment algorithms with self-generated instruction/preference data to foster continuous improvement. Experiments across both in-domain (up to +5.7%) and out-of-domain (+4.4%) benchmarks in English and Chinese demonstrate the effectiveness of the proposed paradigm.
☆ Boosting Lossless Speculative Decoding via Feature Sampling and Partial Alignment Distillation AAAI 2025
Lossless speculative decoding accelerates target large language model (LLM) inference by employing a lightweight draft model for generating tree-structured candidates, which are subsequently verified in parallel by the target LLM. Currently, effective approaches leverage feature-level rather than token-level autoregression within the draft model to facilitate more straightforward predictions and enhanced knowledge distillation. In this paper, we reassess these approaches and propose FSPAD (Feature Sampling and Partial Alignment Distillation for Lossless Speculative Decoding), which introduces two straightforward and effective components within the existing framework to boost lossless speculative decoding. Firstly, FSPAD utilizes token embeddings to sample features of the target LLM in high-dimensional space before feeding them into the draft model, due to the inherent uncertainty of the features preventing the draft model from obtaining the specific token output by the target LLM. Secondly, FSPAD introduces partial alignment distillation to weaken the draft model's connection between features and logits, aiming to reduce the conflict between feature alignment and logit confidence during training. Our experiments include both greedy and non-greedy decoding on the largest and smallest models from the Vicuna and LLaMA3-Instruct series, as well as tasks in multi-turn conversation, translation, summarization, question answering, mathematical reasoning, and retrieval-augmented generation. The results show that FSPAD outperforms the state-of-the-art method across all the aforementioned tasks and target LLMs.
comment: The work was not submitted to AAAI 2025
☆ WildFeedback: Aligning LLMs With In-situ User Interactions And Feedback
As large language models (LLMs) continue to advance, aligning these models with human preferences has emerged as a critical challenge. Traditional alignment methods, relying on human or LLM annotated datasets, are limited by their resource-intensive nature, inherent subjectivity, and the risk of feedback loops that amplify model biases. To overcome these limitations, we introduce WildFeedback, a novel framework that leverages real-time, in-situ user interactions to create preference datasets that more accurately reflect authentic human values. WildFeedback operates through a three-step process: feedback signal identification, preference data construction, and user-guided evaluation. We applied this framework to a large corpus of user-LLM conversations, resulting in a rich preference dataset that reflects genuine user preferences. This dataset captures the nuances of user preferences by identifying and classifying feedback signals within natural conversations, thereby enabling the construction of more representative and context-sensitive alignment data. Our extensive experiments demonstrate that LLMs fine-tuned on WildFeedback exhibit significantly improved alignment with user preferences, as evidenced by both traditional benchmarks and our proposed user-guided evaluation. By incorporating real-time feedback from actual users, WildFeedback addresses the scalability, subjectivity, and bias challenges that plague existing approaches, marking a significant step toward developing LLMs that are more responsive to the diverse and evolving needs of their users. In summary, WildFeedback offers a robust, scalable solution for aligning LLMs with true human values, setting a new standard for the development and evaluation of user-centric language models.
comment: 24 pages
☆ SciLitLLM: How to Adapt LLMs for Scientific Literature Understanding
Scientific literature understanding is crucial for extracting targeted information and garnering insights, thereby significantly advancing scientific discovery. Despite the remarkable success of Large Language Models (LLMs), they face challenges in scientific literature understanding, primarily due to (1) a lack of scientific knowledge and (2) unfamiliarity with specialized scientific tasks. To develop an LLM specialized in scientific literature understanding, we propose a hybrid strategy that integrates continual pre-training (CPT) and supervised fine-tuning (SFT), to simultaneously infuse scientific domain knowledge and enhance instruction-following capabilities for domain-specific tasks.cIn this process, we identify two key challenges: (1) constructing high-quality CPT corpora, and (2) generating diverse SFT instructions. We address these challenges through a meticulous pipeline, including PDF text extraction, parsing content error correction, quality filtering, and synthetic instruction creation. Applying this strategy, we present a suite of LLMs: SciLitLLM, specialized in scientific literature understanding. These models demonstrate promising performance on scientific literature understanding benchmarks. Our contributions are threefold: (1) We present an effective framework that integrates CPT and SFT to adapt LLMs to scientific literature understanding, which can also be easily adapted to other domains. (2) We propose an LLM-based synthesis method to generate diverse and high-quality scientific instructions, resulting in a new instruction set -- SciLitIns -- for supervised fine-tuning in less-represented scientific domains. (3) SciLitLLM achieves promising performance improvements on scientific literature understanding benchmarks.
☆ An Investigation of Warning Erroneous Chat Translations in Cross-lingual Communication
The complexities of chats pose significant challenges for machine translation models. Recognizing the need for a precise evaluation metric to address the issues of chat translation, this study introduces Multidimensional Quality Metrics for Chat Translation (MQM-Chat). Through the experiments of five models using MQM-Chat, we observed that all models generated certain fundamental errors, while each of them has different shortcomings, such as omission, overly correcting ambiguous source content, and buzzword issues, resulting in the loss of stylized information. Our findings underscore the effectiveness of MQM-Chat in evaluating chat translation, emphasizing the importance of stylized content and dialogue consistency for future studies.
☆ Dolphin: Long Context as a New Modality for Energy-Efficient On-Device Language Models
This paper presents Dolphin, a novel decoder-decoder architecture for energy-efficient processing of long contexts in language models. Our approach addresses the significant energy consumption and latency challenges inherent in on-device models. Dolphin employs a compact 0.5B parameter decoder to distill extensive contextual information into a memory embedding, substantially reducing the input length for the primary 7B parameter decoder model. Inspired by vision-language models, we repurpose the image embedding projector to encode long textual contexts, effectively treating extended context as a distinct modality. This innovative method enables processing of substantially longer contexts without the typical computational overhead associated with extended input sequences. Empirical evaluations demonstrate a 10-fold improvement in energy efficiency and a 5-fold reduction in latency compared to conventional full-length context processing methods without losing quality of the response. Our work contributes to the development of more sustainable and scalable language models for on-device applications, addressing the critical need for energy-efficient and responsive AI technologies in resource-constrained environments while maintaining the accuracy to understand long contexts. This research has implications for the broader field of natural language processing, particularly in the domain of efficient model design for resource-limited settings. By enabling more sophisticated AI capabilities on edge devices, Dolphin paves the way for advanced language processing in a wide range of applications where computational resources are at a premium. The Dolphin model is publicly available at https://huggingface.co/NexaAIDev/Dolphin.
☆ Towards Fully Autonomous Research Powered by LLMs: Case Study on Simulations
The advent of Large Language Models (LLMs) has created new opportunities for the automation of scientific research, spanning both experimental processes and computational simulations. This study explores the feasibility of constructing an autonomous simulation agent (ASA) powered by LLM, through sophisticated API integration, to automate the entire research process, from experimental design, remote upload and simulation execution, data analysis, to report compilation. Using a simulation problem of polymer chain conformations as a case study, we assessed the performance of ASAs powered by different LLMs including GPT-4-Turbo. Our findings revealed that ASA-GPT-4o achieved near-flawless execution on designated research missions, underscoring the potential of LLMs to manage complete scientific investigations autonomously. The outlined automation can be iteratively performed up to twenty cycles without human intervention, illustrating the potential of LLMs for large-scale autonomous research endeavors. Additionally, we discussed the intrinsic traits of ASAs in managing extensive tasks, focusing on self-validation mechanisms and the balance between local attention and global oversight.
comment: For additional code and data, please visit our GitHub repository: https://github.com/zokaraa/autonomous_simulation_agent
☆ Measuring the Reliability of Causal Probing Methods: Tradeoffs, Limitations, and the Plight of Nullifying Interventions
Causal probing is an approach to interpreting foundation models, such as large language models, by training probes to recognize latent properties of interest from embeddings, intervening on probes to modify this representation, and analyzing the resulting changes in the model's behavior. While some recent works have cast doubt on the theoretical basis of several leading causal probing intervention methods, it has been unclear how to systematically and empirically evaluate their effectiveness in practice. To address this problem, we propose a general empirical analysis framework to evaluate the reliability of causal probing interventions, formally defining and quantifying two key causal probing desiderata: completeness (fully transforming the representation of the target property) and selectivity (minimally impacting other properties). Our formalism allows us to make the first direct comparisons between different families of causal probing methods (e.g., linear vs. nonlinear or counterfactual vs. nullifying interventions). We conduct extensive experiments across several leading methods, finding that (1) there is an inherent tradeoff between these criteria, and no method is able to consistently satisfy both at once; and (2) across the board, nullifying interventions are always far less complete than counterfactual interventions, indicating that nullifying methods may not be an effective approach to causal probing.
☆ Enhancing and Accelerating Large Language Models via Instruction-Aware Contextual Compression
Large Language Models (LLMs) have garnered widespread attention due to their remarkable performance across various tasks. However, to mitigate the issue of hallucinations, LLMs often incorporate retrieval-augmented pipeline to provide them with rich external knowledge and context. Nevertheless, challenges stem from inaccurate and coarse-grained context retrieved from the retriever. Supplying irrelevant context to the LLMs can result in poorer responses, increased inference latency, and higher costs. This paper introduces a method called Instruction-Aware Contextual Compression, which filters out less informative content, thereby accelerating and enhancing the use of LLMs. The experimental results demonstrate that Instruction-Aware Contextual Compression notably reduces memory consumption and minimizes generation latency while maintaining performance levels comparable to those achieved with the use of the full context. Specifically, we achieved a 50% reduction in context-related costs, resulting in a 5% reduction in inference memory usage and a 2.2-fold increase in inference speed, with only a minor drop of 0.047 in Rouge-1. These findings suggest that our method strikes an effective balance between efficiency and performance.
comment: 20 pages
☆ Legilimens: Practical and Unified Content Moderation for Large Language Model Services CCS
Given the societal impact of unsafe content generated by large language models (LLMs), ensuring that LLM services comply with safety standards is a crucial concern for LLM service providers. Common content moderation methods are limited by an effectiveness-and-efficiency dilemma, where simple models are fragile while sophisticated models consume excessive computational resources. In this paper, we reveal for the first time that effective and efficient content moderation can be achieved by extracting conceptual features from chat-oriented LLMs, despite their initial fine-tuning for conversation rather than content moderation. We propose a practical and unified content moderation framework for LLM services, named Legilimens, which features both effectiveness and efficiency. Our red-team model-based data augmentation enhances the robustness of Legilimens against state-of-the-art jailbreaking. Additionally, we develop a framework to theoretically analyze the cost-effectiveness of Legilimens compared to other methods. We have conducted extensive experiments on five host LLMs, seventeen datasets, and nine jailbreaking methods to verify the effectiveness, efficiency, and robustness of Legilimens against normal and adaptive adversaries. A comparison of Legilimens with both commercial and academic baselines demonstrates the superior performance of Legilimens. Furthermore, we confirm that Legilimens can be applied to few-shot scenarios and extended to multi-label classification tasks.
comment: Accepted by ACM Conference on Computer and Communications Security (CCS) 2024
♻ ☆ Flextron: Many-in-One Flexible Large Language Model
Training modern LLMs is extremely resource intensive, and customizing them for various deployment scenarios characterized by limited compute and memory resources through repeated training is impractical. In this paper, we introduce Flextron, a network architecture and post-training model optimization framework supporting flexible model deployment. The Flextron architecture utilizes a nested elastic structure to rapidly adapt to specific user-defined latency and accuracy targets during inference with no additional fine-tuning required. It is also input-adaptive, and can automatically route tokens through its sub-networks for improved performance and efficiency. We present a sample-efficient training method and associated routing algorithms for systematically transforming an existing trained LLM into a Flextron model. We evaluate Flextron on the GPT-3 and LLama-2 family of LLMs, and demonstrate superior performance over multiple end-to-end trained variants and other state-of-the-art elastic networks, all with a single pretraining run that consumes a mere 7.63% tokens compared to original pretraining.
♻ ☆ Towards Human-Level Text Coding with LLMs: The Case of Fatherhood Roles in Public Policy Documents
Recent advances in large language models (LLMs) like GPT-3.5 and GPT-4 promise automation with better results and less programming, opening up new opportunities for text analysis in political science. In this study, we evaluate LLMs on three original coding tasks involving typical complexities encountered in political science settings: a non-English language, legal and political jargon, and complex labels based on abstract constructs. Along the paper, we propose a practical workflow to optimize the choice of the model and the prompt. We find that the best prompting strategy consists of providing the LLMs with a detailed codebook, as the one provided to human coders. In this setting, an LLM can be as good as or possibly better than a human annotator while being much faster, considerably cheaper, and much easier to scale to large amounts of text. We also provide a comparison of GPT and popular open-source LLMs, discussing the trade-offs in the model's choice. Our software allows LLMs to be easily used as annotators and is publicly available: https://github.com/lorelupo/pappa.
♻ ☆ HC3 Plus: A Semantic-Invariant Human ChatGPT Comparison Corpus CIKM2023
ChatGPT has garnered significant interest due to its impressive performance; however, there is growing concern about its potential risks, particularly in the detection of AI-generated content (AIGC), which is often challenging for untrained individuals to identify. Current datasets used for detecting ChatGPT-generated text primarily focus on question-answering tasks, often overlooking tasks with semantic-invariant properties, such as summarization, translation, and paraphrasing. In this paper, we demonstrate that detecting model-generated text in semantic-invariant tasks is more challenging. To address this gap, we introduce a more extensive and comprehensive dataset that incorporates a wider range of tasks than previous work, including those with semantic-invariant properties.
comment: This paper has been accepted by CIKM2023 workshop
♻ ☆ From Complexity to Clarity: How AI Enhances Perceptions of Scientists and the Public's Understanding of Science
This paper evaluated the effectiveness of using generative AI to simplify science communication and enhance the public's understanding of science. By comparing lay summaries of journal articles from PNAS, yoked to those generated by AI, this work first assessed linguistic simplicity differences across such summaries and public perceptions in follow-up experiments. Specifically, Study 1a analyzed simplicity features of PNAS abstracts (scientific summaries) and significance statements (lay summaries), observing that lay summaries were indeed linguistically simpler, but effect size differences were small. Study 1b used a large language model, GPT-4, to create significance statements based on paper abstracts and this more than doubled the average effect size without fine-tuning. Study 2 experimentally demonstrated that simply-written GPT summaries facilitated more favorable perceptions of scientists (they were perceived as more credible and trustworthy, but less intelligent) than more complexly-written human PNAS summaries. Crucially, Study 3 experimentally demonstrated that participants comprehended scientific writing better after reading simple GPT summaries compared to complex PNAS summaries. In their own words, participants also summarized scientific papers in a more detailed and concrete manner after reading GPT summaries compared to PNAS summaries of the same article. AI has the potential to engage scientific communities and the public via a simple language heuristic, advocating for its integration into scientific dissemination for a more informed society.
comment: 17 pages
♻ ☆ RecurrentGemma: Moving Past Transformers for Efficient Open Language Models
We introduce RecurrentGemma, a family of open language models which uses Google's novel Griffin architecture. Griffin combines linear recurrences with local attention to achieve excellent performance on language. It has a fixed-sized state, which reduces memory use and enables efficient inference on long sequences. We provide two sizes of models, containing 2B and 9B parameters, and provide pre-trained and instruction tuned variants for both. Our models achieve comparable performance to similarly-sized Gemma baselines despite being trained on fewer tokens.
♻ ☆ A Statistical Framework of Watermarks for Large Language Models: Pivot, Detection Efficiency and Optimal Rules
Since ChatGPT was introduced in November 2022, embedding (nearly) unnoticeable statistical signals into text generated by large language models (LLMs), also known as watermarking, has been used as a principled approach to provable detection of LLM-generated text from its human-written counterpart. In this paper, we introduce a general and flexible framework for reasoning about the statistical efficiency of watermarks and designing powerful detection rules. Inspired by the hypothesis testing formulation of watermark detection, our framework starts by selecting a pivotal statistic of the text and a secret key -- provided by the LLM to the verifier -- to enable controlling the false positive rate (the error of mistakenly detecting human-written text as LLM-generated). Next, this framework allows one to evaluate the power of watermark detection rules by obtaining a closed-form expression of the asymptotic false negative rate (the error of incorrectly classifying LLM-generated text as human-written). Our framework further reduces the problem of determining the optimal detection rule to solving a minimax optimization program. We apply this framework to two representative watermarks -- one of which has been internally implemented at OpenAI -- and obtain several findings that can be instrumental in guiding the practice of implementing watermarks. In particular, we derive optimal detection rules for these watermarks under our framework. These theoretically derived detection rules are demonstrated to be competitive and sometimes enjoy a higher power than existing detection approaches through numerical experiments.
♻ ☆ Downstream bias mitigation is all you need
The advent of transformer-based architectures and large language models (LLMs) have significantly advanced the performance of natural language processing (NLP) models. Since these LLMs are trained on huge corpuses of data from the web and other sources, there has been a major concern about harmful prejudices that may potentially be transferred from the data. In many applications, these pre-trained LLMs are fine-tuned on task specific datasets, which can further contribute to biases. This paper studies the extent of biases absorbed by LLMs during pre-training as well as task-specific behaviour after fine-tuning. We found that controlled interventions on pre-trained LLMs, prior to fine-tuning, have minimal effect on lowering biases in classifiers. However, the biases present in domain-specific datasets play a much bigger role, and hence mitigating them at this stage has a bigger impact. While pre-training does matter, but after the model has been pre-trained, even slight changes to co-occurrence rates in the fine-tuning dataset has a significant effect on the bias of the model.
comment: arXiv admin note: This work has been withdrawn by arXiv administrators due to inappropriate text reuse from external sources
♻ ☆ Look Before You Leap: Towards Decision-Aware and Generalizable Tool-Usage for Large Language Models
Tool-augmented large language models (LLMs) are attracting widespread attention when accessing up-to-date knowledge and alleviating hallucination issues. Nowadays, advanced closed-source LLMs (e.g., ChatGPT) have demonstrated surprising tool-usage capabilities through prompting and in-context learning techniques. To empower the capabilities of open-source LLMs (e.g., LLaMA) in manipulating tools, current efforts focus on either template-driven or token-triggered tool-usage. However, the former hampers LLMs' flexibility to address diverse user's queries due to constrained tool interactions, while the latter limits the generalizability when engaging with new tools, since tool-usage learning is based on task- and tool-specific datasets. To alleviate these concerns, in this paper, we propose a decision-aware and generalizable tool-usage framework (DEER). Specifically, we first construct the tool-usage samples with multiple decision branches via an automatic generation pipeline, thereby inspiring the decision-making awareness of LLMs under diverse scenarios. Meanwhile, we propose a novel tool sampling strategy to enhance the generalizability of LLMs over unseen tools. Extensive experiments demonstrate that our proposed DEER is effective and significantly outperforms baselines across various datasets.
comment: 20 pages, 18 figures
♻ ☆ eRST: A Signaled Graph Theory of Discourse Relations and Organization
In this article we present Enhanced Rhetorical Structure Theory (eRST), a new theoretical framework for computational discourse analysis, based on an expansion of Rhetorical Structure Theory (RST). The framework encompasses discourse relation graphs with tree-breaking, non-projective and concurrent relations, as well as implicit and explicit signals which give explainable rationales to our analyses. We survey shortcomings of RST and other existing frameworks, such as Segmented Discourse Representation Theory (SDRT), the Penn Discourse Treebank (PDTB) and Discourse Dependencies, and address these using constructs in the proposed theory. We provide annotation, search and visualization tools for data, and present and evaluate a freely available corpus of English annotated according to our framework, encompassing 12 spoken and written genres with over 200K tokens. Finally, we discuss automatic parsing, evaluation metrics and applications for data in our framework.
♻ ☆ SurGen: Text-Guided Diffusion Model for Surgical Video Generation
Diffusion-based video generation models have made significant strides, producing outputs with improved visual fidelity, temporal coherence, and user control. These advancements hold great promise for improving surgical education by enabling more realistic, diverse, and interactive simulation environments. In this study, we introduce SurGen, a text-guided diffusion model tailored for surgical video synthesis, producing the highest resolution and longest duration videos among existing surgical video generation models. We validate the visual and temporal quality of the outputs using standard image and video generation metrics. Additionally, we assess their alignment to the corresponding text prompts through a deep learning classifier trained on surgical data. Our results demonstrate the potential of diffusion models to serve as valuable educational tools for surgical trainees.
♻ ☆ Unveiling the Statistical Foundations of Chain-of-Thought Prompting Methods
Chain-of-Thought (CoT) prompting and its variants have gained popularity as effective methods for solving multi-step reasoning problems using pretrained large language models (LLMs). In this work, we analyze CoT prompting from a statistical estimation perspective, providing a comprehensive characterization of its sample complexity. To this end, we introduce a multi-step latent variable model that encapsulates the reasoning process, where the latent variable encodes the task information. Under this framework, we demonstrate that when the pretraining dataset is sufficiently large, the estimator formed by CoT prompting is equivalent to a Bayesian estimator. This estimator effectively solves the multi-step reasoning problem by aggregating a posterior distribution inferred from the demonstration examples in the prompt. Moreover, we prove that the statistical error of the CoT estimator can be decomposed into two main components: (i) a prompting error, which arises from inferring the true task using CoT prompts, and (ii) the statistical error of the pretrained LLM. We establish that, under appropriate assumptions, the prompting error decays exponentially to zero as the number of demonstrations increases. Additionally, we explicitly characterize the approximation and generalization errors of the pretrained LLM. Notably, we construct a transformer model that approximates the target distribution of the multi-step reasoning problem with an error that decreases exponentially in the number of transformer blocks. Our analysis extends to other variants of CoT, including Self-Consistent CoT, Tree-of-Thought, and Selection-Inference, offering a broad perspective on the efficacy of these methods. We also provide numerical experiments to validate the theoretical findings.
comment: 150 pages, 18 figures, 3 tables
♻ ☆ Stick to your Role! Stability of Personal Values Expressed in Large Language Models
The standard way to study Large Language Models (LLMs) with benchmarks or psychology questionnaires is to provide many different queries from similar minimal contexts (e.g. multiple choice questions). However, due to LLMs' highly context-dependent nature, conclusions from such minimal-context evaluations may be little informative about the model's behavior in deployment (where it will be exposed to many new contexts). We argue that context-dependence (specifically, value stability) should be studied as a specific property of LLMs and used as another dimension of LLM comparison (alongside others such as cognitive abilities, knowledge, or model size). We present a case-study on the stability of value expression over different contexts (simulated conversations on different topics) as measured using a standard psychology questionnaire (PVQ) and on behavioral downstream tasks. Reusing methods from psychology, we study Rank-order stability on the population (interpersonal) level, and Ipsative stability on the individual (intrapersonal) level. We consider two settings (with and without instructing LLMs to simulate particular personas), two simulated populations, and three downstream tasks. We observe consistent trends in the stability of models and model families - Mixtral, Mistral, GPT-3.5 and Qwen families are more stable than LLaMa-2 and Phi. The consistency of these trends implies that some models exhibit higher value stability than others, and that stability can be estimated with the set of introduced methodological tools. When instructed to simulate particular personas, LLMs exhibit low Rank-order stability, which further diminishes with conversation length. This highlights the need for future research on LLMs that coherently simulate different personas. This paper provides a foundational step in that direction, and, to our knowledge, it is the first study of value stability in LLMs.
comment: The project website and code are available at https://sites.google.com/view/llmvaluestability Published in PLOS ONE ( https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0309114 ), and a shorter version at CogSci 24 ( https://escholarship.org/uc/item/7w4823c6 )
♻ ☆ Evaluating Large Language Models on Spatial Tasks: A Multi-Task Benchmarking Study
The advent of large language models such as ChatGPT, Gemini, and others has underscored the importance of evaluating their diverse capabilities, ranging from natural language understanding to code generation. However, their performance on spatial tasks has not been comprehensively assessed. This study addresses this gap by introducing a novel multi-task spatial evaluation dataset, designed to systematically explore and compare the performance of several advanced models on spatial tasks. The dataset encompasses twelve distinct task types, including spatial understanding and path planning, each with verified, accurate answers. We evaluated multiple models, including OpenAI's gpt-3.5-turbo, gpt-4o, and ZhipuAI's glm-4, through a two-phase testing approach. Initially, we conducted zero-shot testing, followed by categorizing the dataset by difficulty and performing prompt tuning tests. Results indicate that gpt-4o achieved the highest overall accuracy in the first phase, with an average of 71.3%. Although moonshot-v1-8k slightly underperformed overall, it surpassed gpt-4o in place name recognition tasks. The study also highlights the impact of prompt strategies on model performance in specific tasks. For example, the Chain-of-Thought (COT) strategy increased gpt-4o's accuracy in path planning from 12.4% to 87.5%, while a one-shot strategy enhanced moonshot-v1-8k's accuracy in mapping tasks from 10.1% to 76.3%.
♻ ☆ Language-specific Calibration for Pruning Multilingual Language Models
Recent advances in large language model (LLM) pruning have shown state-of-the-art compression results in post-training and retraining-free settings while maintaining high predictive performance. However, such research mainly considers calibrating pruning using English text, despite the multilingual nature of modern LLMs and their frequent uses in non-English languages. In this paper, we set out to explore effective strategies for calibrating the pruning of multilingual language models. We present the first comprehensive empirical study, comparing different calibration languages for pruning multilingual models across diverse tasks, models, and state-of-the-art pruning techniques. Our results present practical suggestions, for example, calibrating in the target language can efficiently yield lower perplexity, but does not necessarily benefit downstream tasks. Our further analysis experiments unveil that calibration in the target language mainly contributes to preserving language-specific features related to fluency and coherence, but might not contribute to capturing language-agnostic features such as language understanding and reasoning. Last, we provide practical recommendations for future practitioners.
♻ ☆ Evading AI-Generated Content Detectors using Homoglyphs
The advent of large language models (LLMs) has enabled the generation of text that increasingly exhibits human-like characteristics. As the detection of such content is of significant importance, numerous studies have been conducted with the aim of developing reliable AI-generated text detectors. These detectors have demonstrated promising results on test data, but recent research has revealed that they can be circumvented by employing different techniques. In this paper, we present homoglyph-based attacks ($a \rightarrow {\alpha}$) as a means of circumventing existing detectors. A comprehensive evaluation was conducted to assess the effectiveness of these attacks on seven detectors, including ArguGPT, Binoculars, DetectGPT, Fast-DetectGPT, Ghostbuster, OpenAI's detector, and watermarking techniques, on five different datasets. Our findings demonstrate that homoglyph-based attacks can effectively circumvent state-of-the-art detectors, leading them to classify all texts as either AI-generated or human-written (decreasing the average Matthews Correlation Coefficient from 0.64 to -0.01). We then examine the effectiveness of these attacks by analyzing how homoglyphs impact different families of detectors. Finally, we discuss the implications of these findings and potential defenses against such attacks.
♻ ☆ Deciphering the Impact of Pretraining Data on Large Language Models through Machine Unlearning ACL 2024
Through pretraining on a corpus with various sources, Large Language Models (LLMs) have gained impressive performance. However, the impact of each component of the pretraining corpus remains opaque. As a result, the organization of the pretraining corpus is still empirical and may deviate from the optimal. To address this issue, we systematically analyze the impact of 48 datasets from 5 major categories of pretraining data of LLMs and measure their impacts on LLMs using benchmarks about nine major categories of model capabilities. Our analyses provide empirical results about the contribution of multiple corpora on the performances of LLMs, along with their joint impact patterns, including complementary, orthogonal, and correlational relationships. We also identify a set of ``high-impact data'' such as Books that is significantly related to a set of model capabilities. These findings provide insights into the organization of data to support more efficient pretraining of LLMs.
comment: Accepted by ACL 2024 Findings
♻ ☆ PASH at TREC 2021 Deep Learning Track: Generative Enhanced Model for Multi-stage Ranking
This paper describes the PASH participation in TREC 2021 Deep Learning Track. In the recall stage, we adopt a scheme combining sparse and dense retrieval method. In the multi-stage ranking phase, point-wise and pair-wise ranking strategies are used one after another based on model continual pre-trained on general knowledge and document-level data. Compared to TREC 2020 Deep Learning Track, we have additionally introduced the generative model T5 to further enhance the performance.
comment: TREC 2021
♻ ☆ Large Language Model Sentinel: LLM Agent for Adversarial Purification
Over the past two years, the use of large language models (LLMs) has advanced rapidly. While these LLMs offer considerable convenience, they also raise security concerns, as LLMs are vulnerable to adversarial attacks by some well-designed textual perturbations. In this paper, we introduce a novel defense technique named Large LAnguage MOdel Sentinel (LLAMOS), which is designed to enhance the adversarial robustness of LLMs by purifying the adversarial textual examples before feeding them into the target LLM. Our method comprises two main components: a) Agent instruction, which can simulate a new agent for adversarial defense, altering minimal characters to maintain the original meaning of the sentence while defending against attacks; b) Defense guidance, which provides strategies for modifying clean or adversarial examples to ensure effective defense and accurate outputs from the target LLMs. Remarkably, the defense agent demonstrates robust defensive capabilities even without learning from adversarial examples. Additionally, we conduct an intriguing adversarial experiment where we develop two agents, one for defense and one for attack, and engage them in mutual confrontation. During the adversarial interactions, neither agent completely beat the other. Extensive experiments on both open-source and closed-source LLMs demonstrate that our method effectively defends against adversarial attacks, thereby enhancing adversarial robustness.
♻ ☆ AI-native Memory: A Pathway from LLMs Towards AGI
Large language models (LLMs) have demonstrated the world with the sparks of artificial general intelligence (AGI). One opinion, especially from some startups working on LLMs, argues that an LLM with nearly unlimited context length can realize AGI. However, they might be too optimistic about the long-context capability of (existing) LLMs -- (1) Recent literature has shown that their effective context length is significantly smaller than their claimed context length; and (2) Our reasoning-in-a-haystack experiments further demonstrate that simultaneously finding the relevant information from a long context and conducting (simple) reasoning is nearly impossible. In this paper, we envision a pathway from LLMs to AGI through the integration of \emph{memory}. We believe that AGI should be a system where LLMs serve as core processors. In addition to raw data, the memory in this system would store a large number of important conclusions derived from reasoning processes. Compared with retrieval-augmented generation (RAG) that merely processing raw data, this approach not only connects semantically related information closer, but also simplifies complex inferences at the time of querying. As an intermediate stage, the memory will likely be in the form of natural language descriptions, which can be directly consumed by users too. Ultimately, every agent/person should have its own large personal model, a deep neural network model (thus \emph{AI-native}) that parameterizes and compresses all types of memory, even the ones cannot be described by natural languages. Finally, we discuss the significant potential of AI-native memory as the transformative infrastructure for (proactive) engagement, personalization, distribution, and social in the AGI era, as well as the incurred privacy and security challenges with preliminary solutions.
♻ ☆ SkyScript-100M: 1,000,000,000 Pairs of Scripts and Shooting Scripts for Short Drama
Generating high-quality shooting scripts containing information such as scene and shot language is essential for short drama script generation. We collect 6,660 popular short drama episodes from the Internet, each with an average of 100 short episodes, and the total number of short episodes is about 80,000, with a total duration of about 2,000 hours and totaling 10 terabytes (TB). We perform keyframe extraction and annotation on each episode to obtain about 10,000,000 shooting scripts. We perform 100 script restorations on the extracted shooting scripts based on our self-developed large short drama generation model SkyReels. This leads to a dataset containing 1,000,000,000 pairs of scripts and shooting scripts for short dramas, called SkyScript-100M. We compare SkyScript-100M with the existing dataset in detail and demonstrate some deeper insights that can be achieved based on SkyScript-100M. Based on SkyScript-100M, researchers can achieve several deeper and more far-reaching script optimization goals, which may drive a paradigm shift in the entire field of text-to-video and significantly advance the field of short drama video generation. The data and code are available at https://github.com/vaew/SkyScript-100M.
comment: 18 pages, 12 figures
♻ ☆ SimpleSpeech 2: Towards Simple and Efficient Text-to-Speech with Flow-based Scalar Latent Transformer Diffusion Models
Scaling Text-to-speech (TTS) to large-scale datasets has been demonstrated as an effective method for improving the diversity and naturalness of synthesized speech. At the high level, previous large-scale TTS models can be categorized into either Auto-regressive (AR) based (\textit{e.g.}, VALL-E) or Non-auto-regressive (NAR) based models (\textit{e.g.}, NaturalSpeech 2/3). Although these works demonstrate good performance, they still have potential weaknesses. For instance, AR-based models are plagued by unstable generation quality and slow generation speed; meanwhile, some NAR-based models need phoneme-level duration alignment information, thereby increasing the complexity of data pre-processing, model design, and loss design. In this work, we build upon our previous publication by implementing a simple and efficient non-autoregressive (NAR) TTS framework, termed SimpleSpeech 2. SimpleSpeech 2 effectively combines the strengths of both autoregressive (AR) and non-autoregressive (NAR) methods, offering the following key advantages: (1) simplified data preparation; (2) straightforward model and loss design; and (3) stable, high-quality generation performance with fast inference speed. Compared to our previous publication, we present ({\romannumeral1}) a detailed analysis of the influence of speech tokenizer and noisy label for TTS performance; ({\romannumeral2}) four distinct types of sentence duration predictors; ({\romannumeral3}) a novel flow-based scalar latent transformer diffusion model. With these improvement, we show a significant improvement in generation performance and generation speed compared to our previous work and other state-of-the-art (SOTA) large-scale TTS models. Furthermore, we show that SimpleSpeech 2 can be seamlessly extended to multilingual TTS by training it on multilingual speech datasets. Demos are available on: {https://dongchaoyang.top/SimpleSpeech2\_demo/}.
comment: Submit to TASLP
♻ ☆ xGen-MM (BLIP-3): A Family of Open Large Multimodal Models
This report introduces xGen-MM (also known as BLIP-3), a framework for developing Large Multimodal Models (LMMs). The framework comprises meticulously curated datasets, a training recipe, model architectures, and a resulting suite of LMMs. xGen-MM, short for xGen-MultiModal, expands the Salesforce xGen initiative on foundation AI models. Our models undergo rigorous evaluation across a range of tasks, including both single and multi-image benchmarks. Our pre-trained base model exhibits strong in-context learning capabilities and the instruction-tuned model demonstrates competitive performance among open-source LMMs with similar model sizes. In addition, we introduce a safety-tuned model with DPO, aiming to mitigate harmful behaviors such as hallucinations and improve safety. We open-source our models, curated large-scale datasets, and our fine-tuning codebase to facilitate further advancements in LMM research. Associated resources will be available on our project page above.
♻ ☆ A Survey of Large Language Models for European Languages
Large Language Models (LLMs) have gained significant attention due to their high performance on a wide range of natural language tasks since the release of ChatGPT. The LLMs learn to understand and generate language by training billions of model parameters on vast volumes of text data. Despite being a relatively new field, LLM research is rapidly advancing in various directions. In this paper, we present an overview of LLM families, including LLaMA, PaLM, GPT, and MoE, and the methods developed to create and enhance LLMs for official European Union (EU) languages. We provide a comprehensive summary of common monolingual and multilingual datasets used for pretraining large language models.
♻ ☆ WeKnow-RAG: An Adaptive Approach for Retrieval-Augmented Generation Integrating Web Search and Knowledge Graphs KDD
Large Language Models (LLMs) have greatly contributed to the development of adaptive intelligent agents and are positioned as an important way to achieve Artificial General Intelligence (AGI). However, LLMs are prone to produce factually incorrect information and often produce "phantom" content that undermines their reliability, which poses a serious challenge for their deployment in real-world scenarios. Enhancing LLMs by combining external databases and information retrieval mechanisms is an effective path. To address the above challenges, we propose a new approach called WeKnow-RAG, which integrates Web search and Knowledge Graphs into a "Retrieval-Augmented Generation (RAG)" system. First, the accuracy and reliability of LLM responses are improved by combining the structured representation of Knowledge Graphs with the flexibility of dense vector retrieval. WeKnow-RAG then utilizes domain-specific knowledge graphs to satisfy a variety of queries and domains, thereby improving performance on factual information and complex reasoning tasks by employing multi-stage web page retrieval techniques using both sparse and dense retrieval methods. Our approach effectively balances the efficiency and accuracy of information retrieval, thus improving the overall retrieval process. Finally, we also integrate a self-assessment mechanism for the LLM to evaluate the trustworthiness of the answers it generates. Our approach proves its outstanding effectiveness in a wide range of offline experiments and online submissions.
comment: 8 pages, 2 figures, technical report for 3rd place in Task 3 of Meta KDD Cup 2024 CRAG Challenge
♻ ☆ Large Language Models Understand Layout ECAI-2024
Large language models (LLMs) demonstrate extraordinary abilities in a wide range of natural language processing (NLP) tasks. In this paper, we show that, beyond text understanding capability, LLMs are capable of processing text layouts that are denoted by spatial markers. They are able to answer questions that require explicit spatial perceiving and reasoning, while a drastic performance drop is observed when the spatial markers from the original data are excluded. We perform a series of experiments with the GPT-3.5, Baichuan2, Llama2 and ChatGLM3 models on various types of layout-sensitive datasets for further analysis. The experimental results reveal that the layout understanding ability of LLMs is mainly introduced by the coding data for pretraining, which is further enhanced at the instruction-tuning stage. In addition, layout understanding can be enhanced by integrating low-cost, auto-generated data approached by a novel text game. Finally, we show that layout understanding ability is beneficial for building efficient visual question-answering (VQA) systems.
comment: This paper has been accepted by ECAI-2024
♻ ☆ VHAKG: A Multi-modal Knowledge Graph Based on Synchronized Multi-view Videos of Daily Activities CIKM2024
Multi-modal knowledge graphs (MMKGs), which ground various non-symbolic data (e.g., images and videos) into symbols, have attracted attention as resources enabling knowledge processing and machine learning across modalities. However, the construction of MMKGs for videos consisting of multiple events, such as daily activities, is still in the early stages. In this paper, we construct an MMKG based on synchronized multi-view simulated videos of daily activities. Besides representing the content of daily life videos as event-centric knowledge, our MMKG also includes frame-by-frame fine-grained changes, such as bounding boxes within video frames. In addition, we provide support tools for querying our MMKG. As an application example, we demonstrate that our MMKG facilitates benchmarking vision-language models by providing the necessary vision-language datasets for a tailored task.
comment: 5 pages, 4 figures, accepted by CIKM2024 Resource Track
Computer Vision and Pattern Recognition
☆ Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders
The ability to accurately interpret complex visual information is a crucial topic of multimodal large language models (MLLMs). Recent work indicates that enhanced visual perception significantly reduces hallucinations and improves performance on resolution-sensitive tasks, such as optical character recognition and document analysis. A number of recent MLLMs achieve this goal using a mixture of vision encoders. Despite their success, there is a lack of systematic comparisons and detailed ablation studies addressing critical aspects, such as expert selection and the integration of multiple vision experts. This study provides an extensive exploration of the design space for MLLMs using a mixture of vision encoders and resolutions. Our findings reveal several underlying principles common to various existing strategies, leading to a streamlined yet effective design approach. We discover that simply concatenating visual tokens from a set of complementary vision encoders is as effective as more complex mixing architectures or strategies. We additionally introduce Pre-Alignment to bridge the gap between vision-focused encoders and language tokens, enhancing model coherence. The resulting family of MLLMs, Eagle, surpasses other leading open-source models on major MLLM benchmarks. Models and code: https://github.com/NVlabs/Eagle
comment: Github: https://github.com/NVlabs/Eagle, HuggingFace: https://huggingface.co/NVEagle
☆ Spatio-Temporal Context Prompting for Zero-Shot Action Detection
Spatio-temporal action detection encompasses the tasks of localizing and classifying individual actions within a video. Recent works aim to enhance this process by incorporating interaction modeling, which captures the relationship between people and their surrounding context. However, these approaches have primarily focused on fully-supervised learning, and the current limitation lies in the lack of generalization capability to recognize unseen action categories. In this paper, we aim to adapt the pretrained image-language models to detect unseen actions. To this end, we propose a method which can effectively leverage the rich knowledge of visual-language models to perform Person-Context Interaction. Meanwhile, our Context Prompting module will utilize contextual information to prompt labels, thereby enhancing the generation of more representative text features. Moreover, to address the challenge of recognizing distinct actions by multiple people at the same timestamp, we design the Interest Token Spotting mechanism which employs pretrained visual knowledge to find each person's interest context tokens, and then these tokens will be used for prompting to generate text features tailored to each individual. To evaluate the ability to detect unseen actions, we propose a comprehensive benchmark on J-HMDB, UCF101-24, and AVA datasets. The experiments show that our method achieves superior results compared to previous approaches and can be further extended to multi-action videos, bringing it closer to real-world applications. The code and data can be found in https://webber2933.github.io/ST-CLIP-project-page.
☆ TEDRA: Text-based Editing of Dynamic and Photoreal Actors
Over the past years, significant progress has been made in creating photorealistic and drivable 3D avatars solely from videos of real humans. However, a core remaining challenge is the fine-grained and user-friendly editing of clothing styles by means of textual descriptions. To this end, we present TEDRA, the first method allowing text-based edits of an avatar, which maintains the avatar's high fidelity, space-time coherency, as well as dynamics, and enables skeletal pose and view control. We begin by training a model to create a controllable and high-fidelity digital replica of the real actor. Next, we personalize a pretrained generative diffusion model by fine-tuning it on various frames of the real character captured from different camera angles, ensuring the digital representation faithfully captures the dynamics and movements of the real person. This two-stage process lays the foundation for our approach to dynamic human avatar editing. Utilizing this personalized diffusion model, we modify the dynamic avatar based on a provided text prompt using our Personalized Normal Aligned Score Distillation Sampling (PNA-SDS) within a model-based guidance framework. Additionally, we propose a time step annealing strategy to ensure high-quality edits. Our results demonstrate a clear improvement over prior work in functionality and visual quality.
comment: For project page, see this https://vcai.mpi-inf.mpg.de/projects/Tedra
☆ Perceive-IR: Learning to Perceive Degradation Better for All-in-One Image Restoration
The limitations of task-specific and general image restoration methods for specific degradation have prompted the development of all-in-one image restoration techniques. However, the diversity of patterns among multiple degradation, along with the significant uncertainties in mapping between degraded images of different severities and their corresponding undistorted versions, pose significant challenges to the all-in-one restoration tasks. To address these challenges, we propose Perceive-IR, an all-in-one image restorer designed to achieve fine-grained quality control that enables restored images to more closely resemble their undistorted counterparts, regardless of the type or severity of degradation. Specifically, Perceive-IR contains two stages: (1) prompt learning stage and (2) restoration stage. In the prompt learning stage, we leverage prompt learning to acquire a fine-grained quality perceiver capable of distinguishing three-tier quality levels by constraining the prompt-image similarity in the CLIP perception space. Subsequently, this quality perceiver and difficulty-adaptive perceptual loss are integrated as a quality-aware learning strategy to realize fine-grained quality control in restoration stage. For the restoration stage, a semantic guidance module (SGM) and compact feature extraction (CFE) are proposed to further promote the restoration process by utilizing the robust semantic information from the pre-trained large scale vision models and distinguishing degradation-specific features. Extensive experiments demonstrate that our Perceive-IR outperforms state-of-the-art methods in all-in-one image restoration tasks and exhibit superior generalization ability when dealing with unseen tasks.
comment: 13 pages, 8 figures
☆ ClimDetect: A Benchmark Dataset for Climate Change Detection and Attribution
Detecting and attributing temperature increases due to climate change is crucial for understanding global warming and guiding adaptation strategies. The complexity of distinguishing human-induced climate signals from natural variability has challenged traditional detection and attribution (D&A) approaches, which seek to identify specific "fingerprints" in climate response variables. Deep learning offers potential for discerning these complex patterns in expansive spatial datasets. However, lack of standard protocols has hindered consistent comparisons across studies. We introduce ClimDetect, a standardized dataset of over 816k daily climate snapshots, designed to enhance model accuracy in identifying climate change signals. ClimDetect integrates various input and target variables used in past research, ensuring comparability and consistency. We also explore the application of vision transformers (ViT) to climate data, a novel and modernizing approach in this context. Our open-access data and code serve as a benchmark for advancing climate science through improved model evaluations. ClimDetect is publicly accessible via Huggingface dataet respository at: https://huggingface.co/datasets/ClimDetect/ClimDetect.
☆ CoGen: Learning from Feedback with Coupled Comprehension and Generation
Systems with both language comprehension and generation capabilities can benefit from the tight connection between the two. This work studies coupling comprehension and generation with focus on continually learning from interaction with users. We propose techniques to tightly integrate the two capabilities for both learning and inference. We situate our studies in two-player reference games, and deploy various models for thousands of interactions with human users, while learning from interaction feedback signals. We show dramatic improvements in performance over time, with comprehension-generation coupling leading to performance improvements up to 26% in absolute terms and up to 17% higher accuracies compared to a non-coupled system. Our analysis also shows coupling has substantial qualitative impact on the system's language, making it significantly more human-like.
comment: 17 pages, 9 figures
☆ Distribution Backtracking Builds A Faster Convergence Trajectory for One-step Diffusion Distillation
Accelerating the sampling speed of diffusion models remains a significant challenge. Recent score distillation methods distill a heavy teacher model into an one-step student generator, which is optimized by calculating the difference between the two score functions on the samples generated by the student model. However, there is a score mismatch issue in the early stage of the distillation process, because existing methods mainly focus on using the endpoint of pre-trained diffusion models as teacher models, overlooking the importance of the convergence trajectory between the student generator and the teacher model. To address this issue, we extend the score distillation process by introducing the entire convergence trajectory of teacher models and propose Distribution Backtracking Distillation (DisBack) for distilling student generators. DisBask is composed of two stages: Degradation Recording and Distribution Backtracking. Degradation Recording is designed to obtain the convergence trajectory of teacher models, which records the degradation path from the trained teacher model to the untrained initial student generator. The degradation path implicitly represents the intermediate distributions of teacher models. Then Distribution Backtracking trains a student generator to backtrack the intermediate distributions for approximating the convergence trajectory of teacher models. Extensive experiments show that DisBack achieves faster and better convergence than the existing distillation method and accomplishes comparable generation performance. Notably, DisBack is easy to implement and can be generalized to existing distillation methods to boost performance. Our code is publicly available on https://github.com/SYZhang0805/DisBack.
☆ More Text, Less Point: Towards 3D Data-Efficient Point-Language Understanding
Enabling Large Language Models (LLMs) to comprehend the 3D physical world remains a significant challenge. Due to the lack of large-scale 3D-text pair datasets, the success of LLMs has yet to be replicated in 3D understanding. In this paper, we rethink this issue and propose a new task: 3D Data-Efficient Point-Language Understanding. The goal is to enable LLMs to achieve robust 3D object understanding with minimal 3D point cloud and text data pairs. To address this task, we introduce GreenPLM, which leverages more text data to compensate for the lack of 3D data. First, inspired by using CLIP to align images and text, we utilize a pre-trained point cloud-text encoder to map the 3D point cloud space to the text space. This mapping leaves us to seamlessly connect the text space with LLMs. Once the point-text-LLM connection is established, we further enhance text-LLM alignment by expanding the intermediate text space, thereby reducing the reliance on 3D point cloud data. Specifically, we generate 6M free-text descriptions of 3D objects, and design a three-stage training strategy to help LLMs better explore the intrinsic connections between different modalities. To achieve efficient modality alignment, we design a zero-parameter cross-attention module for token pooling. Extensive experimental results show that GreenPLM requires only 12% of the 3D training data used by existing state-of-the-art models to achieve superior 3D understanding. Remarkably, GreenPLM also achieves competitive performance using text-only data. The code and weights are available at: https://github.com/TangYuan96/GreenPLM.
☆ Efficient Slice Anomaly Detection Network for 3D Brain MRI Volume
Current anomaly detection methods excel with benchmark industrial data but struggle with natural images and medical data due to varying definitions of 'normal' and 'abnormal.' This makes accurate identification of deviations in these fields particularly challenging. Especially for 3D brain MRI data, all the state-of-the-art models are reconstruction-based with 3D convolutional neural networks which are memory-intensive, time-consuming and producing noisy outputs that require further post-processing. We propose a framework called Simple Slice-based Network (SimpleSliceNet), which utilizes a model pre-trained on ImageNet and fine-tuned on a separate MRI dataset as a 2D slice feature extractor to reduce computational cost. We aggregate the extracted features to perform anomaly detection tasks on 3D brain MRI volumes. Our model integrates a conditional normalizing flow to calculate log likelihood of features and employs the Semi-Push-Pull Mechanism to enhance anomaly detection accuracy. The results indicate improved performance, showcasing our model's remarkable adaptability and effectiveness when addressing the challenges exists in brain MRI data. In addition, for the large-scale 3D brain volumes, our model SimpleSliceNet outperforms the state-of-the-art 2D and 3D models in terms of accuracy, memory usage and time consumption. Code is available at: https://anonymous.4open.science/r/SimpleSliceNet-8EA3.
comment: 15 pages, 5 figures
☆ Generating Binary Species Range Maps
Accurately predicting the geographic ranges of species is crucial for assisting conservation efforts. Traditionally, range maps were manually created by experts. However, species distribution models (SDMs) and, more recently, deep learning-based variants offer a potential automated alternative. Deep learning-based SDMs generate a continuous probability representing the predicted presence of a species at a given location, which must be binarized by setting per-species thresholds to obtain binary range maps. However, selecting appropriate per-species thresholds to binarize these predictions is non-trivial as different species can require distinct thresholds. In this work, we evaluate different approaches for automatically identifying the best thresholds for binarizing range maps using presence-only data. This includes approaches that require the generation of additional pseudo-absence data, along with ones that only require presence data. We also propose an extension of an existing presence-only technique that is more robust to outliers. We perform a detailed evaluation of different thresholding techniques on the tasks of binary range estimation and large-scale fine-grained visual classification, and we demonstrate improved performance over existing pseudo-absence free approaches using our method.
☆ Fall Detection for Smart Living using YOLOv5
This work introduces a fall detection system using the YOLOv5mu model, which achieved a mean average precision (mAP) of 0.995, demonstrating exceptional accuracy in identifying fall events within smart home environments. Enhanced by advanced data augmentation techniques, the model demonstrates significant robustness and adaptability across various conditions. The integration of YOLOv5mu offers precise, real-time fall detection, which is crucial for improving safety and emergency response for residents. Future research will focus on refining the system by incorporating contextual data and exploring multi-sensor approaches to enhance its performance and practical applicability in diverse environments.
☆ InstanSeg: an embedding-based instance segmentation algorithm optimized for accurate, efficient and portable cell segmentation
Cell and nucleus segmentation are fundamental tasks for quantitative bioimage analysis. Despite progress in recent years, biologists and other domain experts still require novel algorithms to handle increasingly large and complex real-world datasets. These algorithms must not only achieve state-of-the-art accuracy, but also be optimized for efficiency, portability and user-friendliness. Here, we introduce InstanSeg: a novel embedding-based instance segmentation pipeline designed to identify cells and nuclei in microscopy images. Using six public cell segmentation datasets, we demonstrate that InstanSeg can significantly improve accuracy when compared to the most widely used alternative methods, while reducing the processing time by at least 60%. Furthermore, InstanSeg is designed to be fully serializable as TorchScript and supports GPU acceleration on a range of hardware. We provide an open-source implementation of InstanSeg in Python, in addition to a user-friendly, interactive QuPath extension for inference written in Java. Our code and pre-trained models are available at https://github.com/instanseg/instanseg .
comment: 12 pages,6 figures
☆ Auxiliary Input in Training: Incorporating Catheter Features into Deep Learning Models for ECG-Free Dynamic Coronary Roadmapping MICCAI 2024
Dynamic coronary roadmapping is a technology that overlays the vessel maps (the "roadmap") extracted from an offline image sequence of X-ray angiography onto a live stream of X-ray fluoroscopy in real-time. It aims to offer navigational guidance for interventional surgeries without the need for repeated contrast agent injections, thereby reducing the risks associated with radiation exposure and kidney failure. The precision of the roadmaps is contingent upon the accurate alignment of angiographic and fluoroscopic images based on their cardiac phases, as well as precise catheter tip tracking. The former ensures the selection of a roadmap that closely matches the vessel shape in the current frame, while the latter uses catheter tips as reference points to adjust for translational motion between the roadmap and the present vessel tree. Training deep learning models for both tasks is challenging and underexplored. However, incorporating catheter features into the models could offer substantial benefits, given humans heavily rely on catheters to complete the tasks. To this end, we introduce a simple but effective method, auxiliary input in training (AIT), and demonstrate that it enhances model performance across both tasks, outperforming baseline methods in knowledge incorporation and transfer learning.
comment: MICCAI 2024
☆ Sigma Flows for Image and Data Labeling and Learning Structured Prediction
This paper introduces the sigma flow model for the prediction of structured labelings of data observed on Riemannian manifolds, including Euclidean image domains as special case. The approach combines the Laplace-Beltrami framework for image denoising and enhancement, introduced by Sochen, Kimmel and Malladi about 25 years ago, and the assignment flow approach introduced and studied by the authors. The sigma flow arises as Riemannian gradient flow of generalized harmonic energies and thus is governed by a nonlinear geometric PDE which determines a harmonic map from a closed Riemannian domain manifold to a statistical manifold, equipped with the Fisher-Rao metric from information geometry. A specific ingredient of the sigma flow is the mutual dependency of the Riemannian metric of the domain manifold on the evolving state. This makes the approach amenable to machine learning in a specific way, by realizing this dependency through a mapping with compact time-variant parametrization that can be learned from data. Proof of concept experiments demonstrate the expressivity of the sigma flow model and prediction performance. Structural similarities to transformer network architectures and networks generated by the geometric integration of sigma flows are pointed out, which highlights the connection to deep learning and, conversely, may stimulate the use of geometric design principles for structured prediction in other areas of scientific machine learning.
comment: 51 pages
☆ Local Descriptors Weighted Adaptive Threshold Filtering For Few-Shot Learning
Few-shot image classification is a challenging task in the field of machine learning, involving the identification of new categories using a limited number of labeled samples. In recent years, methods based on local descriptors have made significant progress in this area. However, the key to improving classification accuracy lies in effectively filtering background noise and accurately selecting critical local descriptors highly relevant to image category information. To address this challenge, we propose an innovative weighted adaptive threshold filtering (WATF) strategy for local descriptors. This strategy can dynamically adjust based on the current task and image context, thereby selecting local descriptors most relevant to the image category. This enables the model to better focus on category-related information while effectively mitigating interference from irrelevant background regions. To evaluate the effectiveness of our method, we adopted the N-way K-shot experimental framework. Experimental results show that our method not only improves the clustering effect of selected local descriptors but also significantly enhances the discriminative ability between image categories. Notably, our method maintains a simple and lightweight design philosophy without introducing additional learnable parameters. This feature ensures consistency in filtering capability during both training and testing phases, further enhancing the reliability and practicality of the method.
☆ DiffAge3D: Diffusion-based 3D-aware Face Aging
Face aging is the process of converting an individual's appearance to a younger or older version of themselves. Existing face aging techniques have been limited to 2D settings, which often weaken their applications as there is a growing demand for 3D face modeling. Moreover, existing aging methods struggle to perform faithful aging, maintain identity, and retain the fine details of the input images. Given these limitations and the need for a 3D-aware aging method, we propose DiffAge3D, the first 3D-aware aging framework that not only performs faithful aging and identity preservation but also operates in a 3D setting. Our aging framework allows to model the aging and camera pose separately by only taking a single image with a target age. Our framework includes a robust 3D-aware aging dataset generation pipeline by utilizing a pre-trained 3D GAN and the rich text embedding capabilities within CLIP model. Notably, we do not employ any inversion bottleneck in dataset generation. Instead, we randomly generate training samples from the latent space of 3D GAN, allowing us to manipulate the rich latent space of GAN to generate ages even with large gaps. With the generated dataset, we train a viewpoint-aware diffusion-based aging model to control the camera pose and facial age. Through quantitative and qualitative evaluations, we demonstrate that DiffAge3D outperforms existing methods, particularly in multiview-consistent aging and fine details preservation.
☆ Leveraging Open Knowledge for Advancing Task Expertise in Large Language Models
The cultivation of expertise for large language models (LLMs) to solve tasks of specific areas often requires special-purpose tuning with calibrated behaviors on the expected stable outputs. To avoid huge cost brought by manual preparation of instruction datasets and training resources up to hundreds of hours, the exploitation of open knowledge including a wealth of low rank adaptation (LoRA) models and instruction datasets serves as a good starting point. However, existing methods on model and data selection focus on the performance of general-purpose capabilities while neglecting the knowledge gap exposed in domain-specific deployment. In the present study, we propose to bridge such gap by introducing few human-annotated samples (i.e., K-shot) for advancing task expertise of LLMs with open knowledge. Specifically, we develop an efficient and scalable pipeline to cost-efficiently produce task experts where K-shot data intervene in selecting the most promising expert candidates and the task-relevant instructions. A mixture-of-expert (MoE) system is built to make the best use of individual-yet-complementary knowledge between multiple experts. We unveil the two keys to the success of a MoE system, 1) the abidance by K-shot, and 2) the insistence on diversity. For the former, we ensure that models that truly possess problem-solving abilities on K-shot are selected rather than those blind guessers. Besides, during data selection, instructions that share task-relevant contexts with K-shot are prioritized. For the latter, we highlight the diversity of constituting experts and that of the fine-tuning instructions throughout the model and data selection process. Extensive experimental results confirm the superiority of our approach over existing methods on utilization of open knowledge across various tasks. Codes and models will be released later.
comment: 28 pages, 12 tables, 10 figures
☆ CoRe: Context-Regularized Text Embedding Learning for Text-to-Image Personalization
Recent advances in text-to-image personalization have enabled high-quality and controllable image synthesis for user-provided concepts. However, existing methods still struggle to balance identity preservation with text alignment. Our approach is based on the fact that generating prompt-aligned images requires a precise semantic understanding of the prompt, which involves accurately processing the interactions between the new concept and its surrounding context tokens within the CLIP text encoder. To address this, we aim to embed the new concept properly into the input embedding space of the text encoder, allowing for seamless integration with existing tokens. We introduce Context Regularization (CoRe), which enhances the learning of the new concept's text embedding by regularizing its context tokens in the prompt. This is based on the insight that appropriate output vectors of the text encoder for the context tokens can only be achieved if the new concept's text embedding is correctly learned. CoRe can be applied to arbitrary prompts without requiring the generation of corresponding images, thus improving the generalization of the learned text embedding. Additionally, CoRe can serve as a test-time optimization technique to further enhance the generations for specific prompts. Comprehensive experiments demonstrate that our method outperforms several baseline methods in both identity preservation and text alignment. Code will be made publicly available.
☆ Gen-Swarms: Adapting Deep Generative Models to Swarms of Drones
Gen-Swarms is an innovative method that leverages and combines the capabilities of deep generative models with reactive navigation algorithms to automate the creation of drone shows. Advancements in deep generative models, particularly diffusion models, have demonstrated remarkable effectiveness in generating high-quality 2D images. Building on this success, various works have extended diffusion models to 3D point cloud generation. In contrast, alternative generative models such as flow matching have been proposed, offering a simple and intuitive transition from noise to meaningful outputs. However, the application of flow matching models to 3D point cloud generation remains largely unexplored. Gen-Swarms adapts these models to automatically generate drone shows. Existing 3D point cloud generative models create point trajectories which are impractical for drone swarms. In contrast, our method not only generates accurate 3D shapes but also guides the swarm motion, producing smooth trajectories and accounting for potential collisions through a reactive navigation algorithm incorporated into the sampling process. For example, when given a text category like Airplane, Gen-Swarms can rapidly and continuously generate numerous variations of 3D airplane shapes. Our experiments demonstrate that this approach is particularly well-suited for drone shows, providing feasible trajectories, creating representative final shapes, and significantly enhancing the overall performance of drone show generation.
☆ Disentangled Diffusion Autoencoder for Harmonization of Multi-site Neuroimaging Data
Combining neuroimaging datasets from multiple sites and scanners can help increase statistical power and thus provide greater insight into subtle neuroanatomical effects. However, site-specific effects pose a challenge by potentially obscuring the biological signal and introducing unwanted variance. Existing harmonization techniques, which use statistical models to remove such effects, have been shown to incompletely remove site effects while also failing to preserve biological variability. More recently, generative models using GANs or autoencoder-based approaches, have been proposed for site adjustment. However, such methods are known for instability during training or blurry image generation. In recent years, diffusion models have become increasingly popular for their ability to generate high-quality synthetic images. In this work, we introduce the disentangled diffusion autoencoder (DDAE), a novel diffusion model designed for controlling specific aspects of an image. We apply the DDAE to the task of harmonizing MR images by generating high-quality site-adjusted images that preserve biological variability. We use data from 7 different sites and demonstrate the DDAE's superiority in generating high-resolution, harmonized 2D MR images over previous approaches. As far as we are aware, this work marks the first diffusion-based model for site adjustment of neuroimaging data.
☆ SpineMamba: Enhancing 3D Spinal Segmentation in Clinical Imaging through Residual Visual Mamba Layers and Shape Priors
Accurate segmentation of 3D clinical medical images is critical in the diagnosis and treatment of spinal diseases. However, the inherent complexity of spinal anatomy and uncertainty inherent in current imaging technologies, poses significant challenges for semantic segmentation of spinal images. Although convolutional neural networks (CNNs) and Transformer-based models have made some progress in spinal segmentation, their limitations in handling long-range dependencies hinder further improvements in segmentation accuracy.To address these challenges, we introduce a residual visual Mamba layer to effectively capture and model the deep semantic features and long-range spatial dependencies of 3D spinal data. To further enhance the structural semantic understanding of the vertebrae, we also propose a novel spinal shape prior module that captures specific anatomical information of the spine from medical images, significantly enhancing the model's ability to extract structural semantic information of the vertebrae. Comparative and ablation experiments on two datasets demonstrate that SpineMamba outperforms existing state-of-the-art models. On the CT dataset, the average Dice similarity coefficient for segmentation reaches as high as 94.40, while on the MR dataset, it reaches 86.95. Notably, compared to the renowned nnU-Net, SpineMamba achieves superior segmentation performance, exceeding it by up to 2 percentage points. This underscores its accuracy, robustness, and excellent generalization capabilities.
comment: 17 pages, 11 figures
☆ LLaVA-MoD: Making LLaVA Tiny via MoE Knowledge Distillation
We introduce LLaVA-MoD, a novel framework designed to enable the efficient training of small-scale Multimodal Language Models (s-MLLM) by distilling knowledge from large-scale MLLM (l-MLLM). Our approach tackles two fundamental challenges in MLLM distillation. First, we optimize the network structure of s-MLLM by integrating a sparse Mixture of Experts (MoE) architecture into the language model, striking a balance between computational efficiency and model expressiveness. Second, we propose a progressive knowledge transfer strategy to ensure comprehensive knowledge migration. This strategy begins with mimic distillation, where we minimize the Kullback-Leibler (KL) divergence between output distributions to enable the student model to emulate the teacher network's understanding. Following this, we introduce preference distillation via Direct Preference Optimization (DPO), where the key lies in treating l-MLLM as the reference model. During this phase, the s-MLLM's ability to discriminate between superior and inferior examples is significantly enhanced beyond l-MLLM, leading to a better student that surpasses its teacher, particularly in hallucination benchmarks. Extensive experiments demonstrate that LLaVA-MoD outperforms existing models across various multimodal benchmarks while maintaining a minimal number of activated parameters and low computational costs. Remarkably, LLaVA-MoD, with only 2B activated parameters, surpasses Qwen-VL-Chat-7B by an average of 8.8% across benchmarks, using merely 0.3% of the training data and 23% trainable parameters. These results underscore LLaVA-MoD's ability to effectively distill comprehensive knowledge from its teacher model, paving the way for the development of more efficient MLLMs. The code will be available on: https://github.com/shufangxun/LLaVA-MoD.
☆ Unleashing the Temporal-Spatial Reasoning Capacity of GPT for Training-Free Audio and Language Referenced Video Object Segmentation
In this paper, we propose an Audio-Language-Referenced SAM 2 (AL-Ref-SAM 2) pipeline to explore the training-free paradigm for audio and language-referenced video object segmentation, namely AVS and RVOS tasks. The intuitive solution leverages GroundingDINO to identify the target object from a single frame and SAM 2 to segment the identified object throughout the video, which is less robust to spatiotemporal variations due to a lack of video context exploration. Thus, in our AL-Ref-SAM 2 pipeline, we propose a novel GPT-assisted Pivot Selection (GPT-PS) module to instruct GPT-4 to perform two-step temporal-spatial reasoning for sequentially selecting pivot frames and pivot boxes, thereby providing SAM 2 with a high-quality initial object prompt. Within GPT-PS, two task-specific Chain-of-Thought prompts are designed to unleash GPT's temporal-spatial reasoning capacity by guiding GPT to make selections based on a comprehensive understanding of video and reference information. Furthermore, we propose a Language-Binded Reference Unification (LBRU) module to convert audio signals into language-formatted references, thereby unifying the formats of AVS and RVOS tasks in the same pipeline. Extensive experiments on both tasks show that our training-free AL-Ref-SAM 2 pipeline achieves performances comparable to or even better than fully-supervised fine-tuning methods. The code is available at: https://github.com/appletea233/AL-Ref-SAM2.
☆ GenDDS: Generating Diverse Driving Video Scenarios with Prompt-to-Video Generative Model
Autonomous driving training requires a diverse range of datasets encompassing various traffic conditions, weather scenarios, and road types. Traditional data augmentation methods often struggle to generate datasets that represent rare occurrences. To address this challenge, we propose GenDDS, a novel approach for generating driving scenarios generation by leveraging the capabilities of Stable Diffusion XL (SDXL), an advanced latent diffusion model. Our methodology involves the use of descriptive prompts to guide the synthesis process, aimed at producing realistic and diverse driving scenarios. With the power of the latest computer vision techniques, such as ControlNet and Hotshot-XL, we have built a complete pipeline for video generation together with SDXL. We employ the KITTI dataset, which includes real-world driving videos, to train the model. Through a series of experiments, we demonstrate that our model can generate high-quality driving videos that closely replicate the complexity and variability of real-world driving scenarios. This research contributes to the development of sophisticated training data for autonomous driving systems and opens new avenues for creating virtual environments for simulation and validation purposes.
☆ microYOLO: Towards Single-Shot Object Detection on Microcontrollers ECML
This work-in-progress paper presents results on the feasibility of single-shot object detection on microcontrollers using YOLO. Single-shot object detectors like YOLO are widely used, however due to their complexity mainly on larger GPU-based platforms. We present microYOLO, which can be used on Cortex-M based microcontrollers, such as the OpenMV H7 R2, achieving about 3.5 FPS when classifying 128x128 RGB images while using less than 800 KB Flash and less than 350 KB RAM. Furthermore, we share experimental results for three different object detection tasks, analyzing the accuracy of microYOLO on them.
comment: Published at the ECML PKDD Conference 2023, at the 4th Workshop on IoT, Edge, and Mobile for Embedded Machine Learning
☆ What is YOLOv8: An In-Depth Exploration of the Internal Features of the Next-Generation Object Detector
This study presents a detailed analysis of the YOLOv8 object detection model, focusing on its architecture, training techniques, and performance improvements over previous iterations like YOLOv5. Key innovations, including the CSPNet backbone for enhanced feature extraction, the FPN+PAN neck for superior multi-scale object detection, and the transition to an anchor-free approach, are thoroughly examined. The paper reviews YOLOv8's performance across benchmarks like Microsoft COCO and Roboflow 100, highlighting its high accuracy and real-time capabilities across diverse hardware platforms. Additionally, the study explores YOLOv8's developer-friendly enhancements, such as its unified Python package and CLI, which streamline model training and deployment. Overall, this research positions YOLOv8 as a state-of-the-art solution in the evolving object detection field.
☆ Shot Segmentation Based on Von Neumann Entropy for Key Frame Extraction
Video key frame extraction is important in various fields, such as video summary, retrieval, and compression. Therefore, we suggest a video key frame extraction algorithm based on shot segmentation using Von Neumann entropy. The segmentation of shots is achieved through the computation of Von Neumann entropy of the similarity matrix among frames within the video sequence. The initial frame of each shot is selected as key frames, which combines the temporal sequence information of frames. The experimental results show the extracted key frames can fully and accurately represent the original video content while minimizing the number of repeated frames.
comment: 14 pages, 5 figures
☆ Network transferability of adversarial patches in real-time object detection
Adversarial patches in computer vision can be used, to fool deep neural networks and manipulate their decision-making process. One of the most prominent examples of adversarial patches are evasion attacks for object detectors. By covering parts of objects of interest, these patches suppress the detections and thus make the target object 'invisible' to the object detector. Since these patches are usually optimized on a specific network with a specific train dataset, the transferability across multiple networks and datasets is not given. This paper addresses these issues and investigates the transferability across numerous object detector architectures. Our extensive evaluation across various models on two distinct datasets indicates that patches optimized with larger models provide better network transferability than patches that are optimized with smaller models.
comment: 7 pages, 6 figures, 1 table
☆ SITransformer: Shared Information-Guided Transformer for Extreme Multimodal Summarization
Extreme Multimodal Summarization with Multimodal Output (XMSMO) becomes an attractive summarization approach by integrating various types of information to create extremely concise yet informative summaries for individual modalities. Existing methods overlook the issue that multimodal data often contains more topic irrelevant information, which can mislead the model into producing inaccurate summaries especially for extremely short ones. In this paper, we propose SITransformer, a \textbf{S}hared \textbf{I}nformation-guided \textbf{T}ransformer for extreme multimodal summarization. It has a shared information guided pipeline which involves a cross-modal shared information extractor and a cross-modal interaction module. The extractor formulates semantically shared salient information from different modalities by devising a novel filtering process consisting of a differentiable top-k selector and a shared-information guided gating unit. As a result, the common, salient, and relevant contents across modalities are identified. Next, a transformer with cross-modal attentions is developed for intra- and inter-modality learning with the shared information guidance to produce the extreme summary. Comprehensive experiments demonstrate that SITransformer significantly enhances the summarization quality for both video and text summaries for XMSMO. Our code will be publicly available at https://github.com/SichengLeoLiu/MMAsia24-XMSMO.
comment: 8 pages, 5 figures, submitted to ACM Multimedia Asia 2024
☆ Benchmarking foundation models as feature extractors for weakly-supervised computational pathology
Advancements in artificial intelligence have driven the development of numerous pathology foundation models capable of extracting clinically relevant information. However, there is currently limited literature independently evaluating these foundation models on truly external cohorts and clinically-relevant tasks to uncover adjustments for future improvements. In this study, we benchmarked ten histopathology foundation models on 13 patient cohorts with 6,791 patients and 9,493 slides from lung, colorectal, gastric, and breast cancers. The models were evaluated on weakly-supervised tasks related to biomarkers, morphological properties, and prognostic outcomes. We show that a vision-language foundation model, CONCH, yielded the highest performance in 42% of tasks when compared to vision-only foundation models. The experiments reveal that foundation models trained on distinct cohorts learn complementary features to predict the same label, and can be fused to outperform the current state of the art. Creating an ensemble of complementary foundation models outperformed CONCH in 66% of tasks. Moreover, our findings suggest that data diversity outweighs data volume for foundation models. Our work highlights actionable adjustments to improve pathology foundation models.
☆ Mining Field Data for Tree Species Recognition at Scale
Individual tree species labels are particularly hard to acquire due to the expert knowledge needed and the limitations of photointerpretation. Here, we present a methodology to automatically mine species labels from public forest inventory data, using available pretrained tree detection models. We identify tree instances in aerial imagery and match them with field data with close to zero human involvement. We conduct a series of experiments on the resulting dataset, and show a beneficial effect when adding noisy or even unlabeled data points, highlighting a strong potential for large-scale individual species mapping.
☆ DQFormer: Towards Unified LiDAR Panoptic Segmentation with Decoupled Queries
LiDAR panoptic segmentation, which jointly performs instance and semantic segmentation for things and stuff classes, plays a fundamental role in LiDAR perception tasks. While most existing methods explicitly separate these two segmentation tasks and utilize different branches (i.e., semantic and instance branches), some recent methods have embraced the query-based paradigm to unify LiDAR panoptic segmentation. However, the distinct spatial distribution and inherent characteristics of objects(things) and their surroundings(stuff) in 3D scenes lead to challenges, including the mutual competition of things/stuff and the ambiguity of classification/segmentation. In this paper, we propose decoupling things/stuff queries according to their intrinsic properties for individual decoding and disentangling classification/segmentation to mitigate ambiguity. To this end, we propose a novel framework dubbed DQFormer to implement semantic and instance segmentation in a unified workflow. Specifically, we design a decoupled query generator to propose informative queries with semantics by localizing things/stuff positions and fusing multi-level BEV embeddings. Moreover, a query-oriented mask decoder is introduced to decode corresponding segmentation masks by performing masked cross-attention between queries and mask embeddings. Finally, the decoded masks are combined with the semantics of the queries to produce panoptic results. Extensive experiments on nuScenes and SemanticKITTI datasets demonstrate the superiority of our DQFormer framework.
comment: 13 pages, 10 figures
☆ Multi-view Pose Fusion for Occlusion-Aware 3D Human Pose Estimation ECCV
Robust 3D human pose estimation is crucial to ensure safe and effective human-robot collaboration. Accurate human perception,however, is particularly challenging in these scenarios due to strong occlusions and limited camera viewpoints. Current 3D human pose estimation approaches are rather vulnerable in such conditions. In this work we present a novel approach for robust 3D human pose estimation in the context of human-robot collaboration. Instead of relying on noisy 2D features triangulation, we perform multi-view fusion on 3D skeletons provided by absolute monocular methods. Accurate 3D pose estimation is then obtained via reprojection error optimization, introducing limbs length symmetry constraints. We evaluate our approach on the public dataset Human3.6M and on a novel version Human3.6M-Occluded, derived adding synthetic occlusions on the camera views with the purpose of testing pose estimation algorithms under severe occlusions. We further validate our method on real human-robot collaboration workcells, in which we strongly surpass current 3D human pose estimation methods. Our approach outperforms state-of-the-art multi-view human pose estimation techniques and demonstrates superior capabilities in handling challenging scenarios with strong occlusions, representing a reliable and effective solution for real human-robot collaboration setups.
comment: ECCV workshops 2024
☆ Object Detection for Vehicle Dashcams using Transformers
The use of intelligent automation is growing significantly in the automotive industry, as it assists drivers and fleet management companies, thus increasing their productivity. Dash cams are now been used for this purpose which enables the instant identification and understanding of multiple objects and occurrences in the surroundings. In this paper, we propose a novel approach for object detection in dashcams using transformers. Our system is based on the state-of-the-art DEtection TRansformer (DETR), which has demonstrated strong performance in a variety of conditions, including different weather and illumination scenarios. The use of transformers allows for the consideration of contextual information in decisionmaking, improving the accuracy of object detection. To validate our approach, we have trained our DETR model on a dataset that represents real-world conditions. Our results show that the use of intelligent automation through transformers can significantly enhance the capabilities of dashcam systems. The model achieves an mAP of 0.95 on detection.
comment: 7 Pages, and 6 Figures
☆ Visual Prompt Engineering for Medical Vision Language Models in Radiology ECCV 2024
Medical image classification in radiology faces significant challenges, particularly in generalizing to unseen pathologies. In contrast, CLIP offers a promising solution by leveraging multimodal learning to improve zero-shot classification performance. However, in the medical domain, lesions can be small and might not be well represented in the embedding space. Therefore, in this paper, we explore the potential of visual prompt engineering to enhance the capabilities of Vision Language Models (VLMs) in radiology. Leveraging BiomedCLIP, trained on extensive biomedical image-text pairs, we investigate the impact of embedding visual markers directly within radiological images to guide the model's attention to critical regions. Our evaluation on the JSRT dataset, focusing on lung nodule malignancy classification, demonstrates that incorporating visual prompts $\unicode{x2013}$ such as arrows, circles, and contours $\unicode{x2013}$ significantly improves classification metrics including AUROC, AUPRC, F1 score, and accuracy. Moreover, the study provides attention maps, showcasing enhanced model interpretability and focus on clinically relevant areas. These findings underscore the efficacy of visual prompt engineering as a straightforward yet powerful approach to advance VLM performance in medical image analysis.
comment: Accepted at ECCV 2024 Workshop on Emergent Visual Abilities and Limits of Foundation Models
☆ A Survey on Facial Expression Recognition of Static and Dynamic Emotions
Facial expression recognition (FER) aims to analyze emotional states from static images and dynamic sequences, which is pivotal in enhancing anthropomorphic communication among humans, robots, and digital avatars by leveraging AI technologies. As the FER field evolves from controlled laboratory environments to more complex in-the-wild scenarios, advanced methods have been rapidly developed and new challenges and apporaches are encounted, which are not well addressed in existing reviews of FER. This paper offers a comprehensive survey of both image-based static FER (SFER) and video-based dynamic FER (DFER) methods, analyzing from model-oriented development to challenge-focused categorization. We begin with a critical comparison of recent reviews, an introduction to common datasets and evaluation criteria, and an in-depth workflow on FER to establish a robust research foundation. We then systematically review representative approaches addressing eight main challenges in SFER (such as expression disturbance, uncertainties, compound emotions, and cross-domain inconsistency) as well as seven main challenges in DFER (such as key frame sampling, expression intensity variations, and cross-modal alignment). Additionally, we analyze recent advancements, benchmark performances, major applications, and ethical considerations. Finally, we propose five promising future directions and development trends to guide ongoing research. The project page for this paper can be found at https://github.com/wangyanckxx/SurveyFER.
☆ A Survey on Evaluation of Multimodal Large Language Models
Multimodal Large Language Models (MLLMs) mimic human perception and reasoning system by integrating powerful Large Language Models (LLMs) with various modality encoders (e.g., vision, audio), positioning LLMs as the "brain" and various modality encoders as sensory organs. This framework endows MLLMs with human-like capabilities, and suggests a potential pathway towards achieving artificial general intelligence (AGI). With the emergence of all-round MLLMs like GPT-4V and Gemini, a multitude of evaluation methods have been developed to assess their capabilities across different dimensions. This paper presents a systematic and comprehensive review of MLLM evaluation methods, covering the following key aspects: (1) the background of MLLMs and their evaluation; (2) "what to evaluate" that reviews and categorizes existing MLLM evaluation tasks based on the capabilities assessed, including general multimodal recognition, perception, reasoning and trustworthiness, and domain-specific applications such as socioeconomic, natural sciences and engineering, medical usage, AI agent, remote sensing, video and audio processing, 3D point cloud analysis, and others; (3) "where to evaluate" that summarizes MLLM evaluation benchmarks into general and specific benchmarks; (4) "how to evaluate" that reviews and illustrates MLLM evaluation steps and metrics; Our overarching goal is to provide valuable insights for researchers in the field of MLLM evaluation, thereby facilitating the development of more capable and reliable MLLMs. We emphasize that evaluation should be regarded as a critical discipline, essential for advancing the field of MLLMs.
☆ Addressing the challenges of loop detection in agricultural environments
While visual SLAM systems are well studied and achieve impressive results in indoor and urban settings, natural, outdoor and open-field environments are much less explored and still present relevant research challenges. Visual navigation and local mapping have shown a relatively good performance in open-field environments. However, globally consistent mapping and long-term localization still depend on the robustness of loop detection and closure, for which the literature is scarce. In this work we propose a novel method to pave the way towards robust loop detection in open fields, particularly in agricultural settings, based on local feature search and stereo geometric refinement, with a final stage of relative pose estimation. Our method consistently achieves good loop detections, with a median error of 15cm. We aim to characterize open fields as a novel environment for loop detection, understanding the limitations and problems that arise when dealing with them.
☆ Str-L Pose: Integrating Point and Structured Line for Relative Pose Estimation in Dual-Graph
Relative pose estimation is crucial for various computer vision applications, including Robotic and Autonomous Driving. Current methods primarily depend on selecting and matching feature points prone to incorrect matches, leading to poor performance. Consequently, relying solely on point-matching relationships for pose estimation is a huge challenge. To overcome these limitations, we propose a Geometric Correspondence Graph neural network that integrates point features with extra structured line segments. This integration of matched points and line segments further exploits the geometry constraints and enhances model performance across different environments. We employ the Dual-Graph module and Feature Weighted Fusion Module to aggregate geometric and visual features effectively, facilitating complex scene understanding. We demonstrate our approach through extensive experiments on the DeMoN and KITTI Odometry datasets. The results show that our method is competitive with state-of-the-art techniques.
☆ Segmentation-guided Layer-wise Image Vectorization with Gradient Fills
The widespread use of vector graphics creates a significant demand for vectorization methods. While recent learning-based techniques have shown their capability to create vector images of clear topology, filling these primitives with gradients remains a challenge. In this paper, we propose a segmentation-guided vectorization framework to convert raster images into concise vector graphics with radial gradient fills. With the guidance of an embedded gradient-aware segmentation subroutine, our approach progressively appends gradient-filled B\'ezier paths to the output, where primitive parameters are initiated with our newly designed initialization technique and are optimized to minimize our novel loss function. We build our method on a differentiable renderer with traditional segmentation algorithms to develop it as a model-free tool for raster-to-vector conversion. It is tested on various inputs to demonstrate its feasibility, independent of datasets, to synthesize vector graphics with improved visual quality and layer-wise topology compared to prior work.
☆ MambaPlace:Text-to-Point-Cloud Cross-Modal Place Recognition with Attention Mamba Mechanisms
Vision Language Place Recognition (VLVPR) enhances robot localization performance by incorporating natural language descriptions from images. By utilizing language information, VLVPR directs robot place matching, overcoming the constraint of solely depending on vision. The essence of multimodal fusion lies in mining the complementary information between different modalities. However, general fusion methods rely on traditional neural architectures and are not well equipped to capture the dynamics of cross modal interactions, especially in the presence of complex intra modal and inter modal correlations. To this end, this paper proposes a novel coarse to fine and end to end connected cross modal place recognition framework, called MambaPlace. In the coarse localization stage, the text description and 3D point cloud are encoded by the pretrained T5 and instance encoder, respectively. They are then processed using Text Attention Mamba (TAM) and Point Clouds Mamba (PCM) for data enhancement and alignment. In the subsequent fine localization stage, the features of the text description and 3D point cloud are cross modally fused and further enhanced through cascaded Cross Attention Mamba (CCAM). Finally, we predict the positional offset from the fused text point cloud features, achieving the most accurate localization. Extensive experiments show that MambaPlace achieves improved localization accuracy on the KITTI360Pose dataset compared to the state of the art methods.
comment: 8 pages
☆ Defending Text-to-image Diffusion Models: Surprising Efficacy of Textual Perturbations Against Backdoor Attacks ECCV 2024
Text-to-image diffusion models have been widely adopted in real-world applications due to their ability to generate realistic images from textual descriptions. However, recent studies have shown that these methods are vulnerable to backdoor attacks. Despite the significant threat posed by backdoor attacks on text-to-image diffusion models, countermeasures remain under-explored. In this paper, we address this research gap by demonstrating that state-of-the-art backdoor attacks against text-to-image diffusion models can be effectively mitigated by a surprisingly simple defense strategy - textual perturbation. Experiments show that textual perturbations are effective in defending against state-of-the-art backdoor attacks with minimal sacrifice to generation quality. We analyze the efficacy of textual perturbation from two angles: text embedding space and cross-attention maps. They further explain how backdoor attacks have compromised text-to-image diffusion models, providing insights for studying future attack and defense strategies. Our code is available at https://github.com/oscarchew/t2i-backdoor-defense.
comment: ECCV 2024 Workshop The Dark Side of Generative AIs and Beyond
☆ Pixels to Prose: Understanding the art of Image Captioning
In the era of evolving artificial intelligence, machines are increasingly emulating human-like capabilities, including visual perception and linguistic expression. Image captioning stands at the intersection of these domains, enabling machines to interpret visual content and generate descriptive text. This paper provides a thorough review of image captioning techniques, catering to individuals entering the field of machine learning who seek a comprehensive understanding of available options, from foundational methods to state-of-the-art approaches. Beginning with an exploration of primitive architectures, the review traces the evolution of image captioning models to the latest cutting-edge solutions. By dissecting the components of these architectures, readers gain insights into the underlying mechanisms and can select suitable approaches tailored to specific problem requirements without duplicating efforts. The paper also delves into the application of image captioning in the medical domain, illuminating its significance in various real-world scenarios. Furthermore, the review offers guidance on evaluating the performance of image captioning systems, highlighting key metrics for assessment. By synthesizing theoretical concepts with practical application, this paper equips readers with the knowledge needed to navigate the complex landscape of image captioning and harness its potential for diverse applications in machine learning and beyond.
☆ Towards Realistic Example-based Modeling via 3D Gaussian Stitching
Using parts of existing models to rebuild new models, commonly termed as example-based modeling, is a classical methodology in the realm of computer graphics. Previous works mostly focus on shape composition, making them very hard to use for realistic composition of 3D objects captured from real-world scenes. This leads to combining multiple NeRFs into a single 3D scene to achieve seamless appearance blending. However, the current SeamlessNeRF method struggles to achieve interactive editing and harmonious stitching for real-world scenes due to its gradient-based strategy and grid-based representation. To this end, we present an example-based modeling method that combines multiple Gaussian fields in a point-based representation using sample-guided synthesis. Specifically, as for composition, we create a GUI to segment and transform multiple fields in real time, easily obtaining a semantically meaningful composition of models represented by 3D Gaussian Splatting (3DGS). For texture blending, due to the discrete and irregular nature of 3DGS, straightforwardly applying gradient propagation as SeamlssNeRF is not supported. Thus, a novel sampling-based cloning method is proposed to harmonize the blending while preserving the original rich texture and content. Our workflow consists of three steps: 1) real-time segmentation and transformation of a Gaussian model using a well-tailored GUI, 2) KNN analysis to identify boundary points in the intersecting area between the source and target models, and 3) two-phase optimization of the target model using sampling-based cloning and gradient constraints. Extensive experimental results validate that our approach significantly outperforms previous works in terms of realistic synthesis, demonstrating its practicality. More demos are available at https://ingra14m.github.io/gs_stitching_website.
☆ G-Style: Stylized Gaussian Splatting
We introduce G-Style, a novel algorithm designed to transfer the style of an image onto a 3D scene represented using Gaussian Splatting. Gaussian Splatting is a powerful 3D representation for novel view synthesis, as -- compared to other approaches based on Neural Radiance Fields -- it provides fast scene renderings and user control over the scene. Recent pre-prints have demonstrated that the style of Gaussian Splatting scenes can be modified using an image exemplar. However, since the scene geometry remains fixed during the stylization process, current solutions fall short of producing satisfactory results. Our algorithm aims to address these limitations by following a three-step process: In a pre-processing step, we remove undesirable Gaussians with large projection areas or highly elongated shapes. Subsequently, we combine several losses carefully designed to preserve different scales of the style in the image, while maintaining as much as possible the integrity of the original scene content. During the stylization process and following the original design of Gaussian Splatting, we split Gaussians where additional detail is necessary within our scene by tracking the gradient of the stylized color. Our experiments demonstrate that G-Style generates high-quality stylizations within just a few minutes, outperforming existing methods both qualitatively and quantitatively.
☆ Synthetic Forehead-creases Biometric Generation for Reliable User Verification
Recent studies have emphasized the potential of forehead-crease patterns as an alternative for face, iris, and periocular recognition, presenting contactless and convenient solutions, particularly in situations where faces are covered by surgical masks. However, collecting forehead data presents challenges, including cost and time constraints, as developing and optimizing forehead verification methods requires a substantial number of high-quality images. To tackle these challenges, the generation of synthetic biometric data has gained traction due to its ability to protect privacy while enabling effective training of deep learning-based biometric verification methods. In this paper, we present a new framework to synthesize forehead-crease image data while maintaining important features, such as uniqueness and realism. The proposed framework consists of two main modules: a Subject-Specific Generation Module (SSGM), based on an image-to-image Brownian Bridge Diffusion Model (BBDM), which learns a one-to-many mapping between image pairs to generate identity-aware synthetic forehead creases corresponding to real subjects, and a Subject-Agnostic Generation Module (SAGM), which samples new synthetic identities with assistance from the SSGM. We evaluate the diversity and realism of the generated forehead-crease images primarily using the Fr\'echet Inception Distance (FID) and the Structural Similarity Index Measure (SSIM). In addition, we assess the utility of synthetically generated forehead-crease images using a forehead-crease verification system (FHCVS). The results indicate an improvement in the verification accuracy of the FHCVS by utilizing synthetic data.
comment: Accepted at Generative AI for Futuristic Biometrics - IJCB'24 Special Session
☆ A quantitative model of takeover request time budget for conditionally automated driving
In conditional automation, the automated driving system assumes full control and only issues a takeover request to a human driver to resume driving in critical situations. Previous studies have concluded that the time budget required by drivers to resume driving after a takeover request varies with situations and different takeover variables. However, no comprehensive generalized approaches for estimating in advance the time budget required by drivers to takeover have been provided. In this contribution, fixed (7 s) and variable time budgets (6 s, 5 s, and 4 s) with and without visual imagery assistance were investigated for suitability in three takeover scenarios using performance measures such as average lateral displacement. The results indicate that 7 s is suitable for two of the studied scenarios based on their characteristics. Using the obtained results and known relations between takeover variables, a mathematical formula for estimating takeover request time budget is proposed. The proposed formula integrates individual stimulus response time, driving experience, scenario specific requirements and allows increased safety for takeover maneuvers. Furthermore, the visual imagery resulted in increased takeover time which invariably increases the time budget. Thus the time demand of the visualized information if applicable (such as visual imagery) should be included in the time budget.
comment: Manuscript: 12 pages, 12 figures, 7 tables
☆ DEAR: Depth-Enhanced Action Recognition ECCV
Detecting actions in videos, particularly within cluttered scenes, poses significant challenges due to the limitations of 2D frame analysis from a camera perspective. Unlike human vision, which benefits from 3D understanding, recognizing actions in such environments can be difficult. This research introduces a novel approach integrating 3D features and depth maps alongside RGB features to enhance action recognition accuracy. Our method involves processing estimated depth maps through a separate branch from the RGB feature encoder and fusing the features to understand the scene and actions comprehensively. Using the Side4Video framework and VideoMamba, which employ CLIP and VisionMamba for spatial feature extraction, our approach outperformed our implementation of the Side4Video network on the Something-Something V2 dataset. Our code is available at: https://github.com/SadeghRahmaniB/DEAR
comment: 5 pages, 1 figure, 1 table, accepted at Human-inspired Computer Vision, ECCV
☆ Deep Learning Based Speckle Filtering for Polarimetric SAR Images. Application to Sentinel-1
Speckle suppression in synthetic aperture radar (SAR) images is a key processing step which continues to be a research topic. A wide variety of methods, using either spatially-based approaches or transform-based strategies, have been developed and have shown to provide outstanding results. However, recent advances in deep learning techniques and their application to SAR image despeckling have been demonstrated to offer state-of-the-art results. Unfortunately, they have been mostly applied to single-polarimetric images. The extension of a deep learning-based approach for speckle removal to polarimetric SAR (PolSAR) images is complicated because of the complex nature of the measured covariance matrices for every image pixel, the properties of which must be preserved during filtering. In this work, we propose a complete framework to remove speckle in polarimetric SAR images using a convolutional neural network. The methodology includes a reversible transformation of the original complex covariance matrix to obtain a set of real-valued intensity bands which are fed to the neural network. In addition, the proposed method includes a change detection strategy to avoid the neural network to learn erroneous features in areas strongly affected by temporal changes, so that the network only learns the underlying speckle component present in the data. The method is implemented and tested with dual-polarimetric images acquired by Sentinel-1. Experiments show that the proposed approach offers exceptional results in both speckle reduction and resolution preservation. More importantly, it is also shown that the neural network is not generating artifacts or introducing bias in the filtered images, making them suitable for further polarimetric processing and exploitation.
comment: 23 pages, 32 figures
☆ Towards reliable respiratory disease diagnosis based on cough sounds and vision transformers
Recent advancements in deep learning techniques have sparked performance boosts in various real-world applications including disease diagnosis based on multi-modal medical data. Cough sound data-based respiratory disease (e.g., COVID-19 and Chronic Obstructive Pulmonary Disease) diagnosis has also attracted much attention. However, existing works usually utilise traditional machine learning or deep models of moderate scales. On the other hand, the developed approaches are trained and evaluated on small-scale data due to the difficulty of curating and annotating clinical data on scale. To address these issues in prior works, we create a unified framework to evaluate various deep models from lightweight Convolutional Neural Networks (e.g., ResNet18) to modern vision transformers and compare their performance in respiratory disease classification. Based on the observations from such an extensive empirical study, we propose a novel approach to cough-based disease classification based on both self-supervised and supervised learning on a large-scale cough data set. Experimental results demonstrate our proposed approach outperforms prior arts consistently on two benchmark datasets for COVID-19 diagnosis and a proprietary dataset for COPD/non-COPD classification with an AUROC of 92.5%.
☆ Merging and Splitting Diffusion Paths for Semantically Coherent Panoramas ECCV 2024
Diffusion models have become the State-of-the-Art for text-to-image generation, and increasing research effort has been dedicated to adapting the inference process of pretrained diffusion models to achieve zero-shot capabilities. An example is the generation of panorama images, which has been tackled in recent works by combining independent diffusion paths over overlapping latent features, which is referred to as joint diffusion, obtaining perceptually aligned panoramas. However, these methods often yield semantically incoherent outputs and trade-off diversity for uniformity. To overcome this limitation, we propose the Merge-Attend-Diffuse operator, which can be plugged into different types of pretrained diffusion models used in a joint diffusion setting to improve the perceptual and semantical coherence of the generated panorama images. Specifically, we merge the diffusion paths, reprogramming self- and cross-attention to operate on the aggregated latent space. Extensive quantitative and qualitative experimental analysis, together with a user study, demonstrate that our method maintains compatibility with the input prompt and visual quality of the generated images while increasing their semantic coherence. We release the code at https://github.com/aimagelab/MAD.
comment: Accepted at ECCV 2024
☆ TeFF: Tracking-enhanced Forgetting-free Few-shot 3D LiDAR Semantic Segmentation
In autonomous driving, 3D LiDAR plays a crucial role in understanding the vehicle's surroundings. However, the newly emerged, unannotated objects presents few-shot learning problem for semantic segmentation. This paper addresses the limitations of current few-shot semantic segmentation by exploiting the temporal continuity of LiDAR data. Employing a tracking model to generate pseudo-ground-truths from a sequence of LiDAR frames, our method significantly augments the dataset, enhancing the model's ability to learn on novel classes. However, this approach introduces a data imbalance biased to novel data that presents a new challenge of catastrophic forgetting. To mitigate this, we incorporate LoRA, a technique that reduces the number of trainable parameters, thereby preserving the model's performance on base classes while improving its adaptability to novel classes. This work represents a significant step forward in few-shot 3D LiDAR semantic segmentation for autonomous driving. Our code is available at https://github.com/junbao-zhou/Track-no-forgetting.
☆ Realigned Softmax Warping for Deep Metric Learning
Deep Metric Learning (DML) loss functions traditionally aim to control the forces of separability and compactness within an embedding space so that the same class data points are pulled together and different class ones are pushed apart. Within the context of DML, a softmax operation will typically normalize distances into a probability for optimization, thus coupling all the push/pull forces together. This paper proposes a potential new class of loss functions that operate within a euclidean domain and aim to take full advantage of the coupled forces governing embedding space formation under a softmax. These forces of compactness and separability can be boosted or mitigated within controlled locations at will by using a warping function. In this work, we provide a simple example of a warping function and use it to achieve competitive, state-of-the-art results on various metric learning benchmarks.
comment: Preprint
☆ Online pre-training with long-form videos
In this study, we investigate the impact of online pre-training with continuous video clips. We will examine three methods for pre-training (masked image modeling, contrastive learning, and knowledge distillation), and assess the performance on downstream action recognition tasks. As a result, online pre-training with contrast learning showed the highest performance in downstream tasks. Our findings suggest that learning from long-form videos can be helpful for action recognition with short videos.
comment: GCCE2024
☆ Leveraging Persistent Homology for Differential Diagnosis of Mild Cognitive Impairment
Mild cognitive impairment (MCI) is characterized by subtle changes in cognitive functions, often associated with disruptions in brain connectivity. The present study introduces a novel fine-grained analysis to examine topological alterations in neurodegeneration pertaining to six different brain networks of MCI subjects (Early/Late MCI). To achieve this, fMRI time series from two distinct populations are investigated: (i) the publicly accessible ADNI dataset and (ii) our in-house dataset. The study utilizes sliding window embedding to convert each fMRI time series into a sequence of 3-dimensional vectors, facilitating the assessment of changes in regional brain topology. Distinct persistence diagrams are computed for Betti descriptors of dimension-0, 1, and 2. Wasserstein distance metric is used to quantify differences in topological characteristics. We have examined both (i) ROI-specific inter-subject interactions and (ii) subject-specific inter-ROI interactions. Further, a new deep learning model is proposed for classification, achieving a maximum classification accuracy of 95% for the ADNI dataset and 85% for the in-house dataset. This methodology is further adapted for the differential diagnosis of MCI sub-types, resulting in a peak accuracy of 76.5%, 91.1% and 80% in classifying HC Vs. EMCI, HC Vs. LMCI and EMCI Vs. LMCI, respectively. We showed that the proposed approach surpasses current state-of-the-art techniques designed for classifying MCI and its sub-types using fMRI.
comment: 16 pages, 6 figures, 3 tables, accepted at International Conference on Pattern Recognition 2024
☆ μgat: Improving Single-Page Document Parsing by Providing Multi-Page Context ECCV
Regesta are catalogs of summaries of other documents and, in some cases, are the only source of information about the content of such full-length documents. For this reason, they are of great interest to scholars in many social and humanities fields. In this work, we focus on Regesta Pontificum Romanum, a large collection of papal registers. Regesta are visually rich documents, where the layout is as important as the text content to convey the contained information through the structure, and are inherently multi-page documents. Among Digital Humanities techniques that can help scholars efficiently exploit regesta and other documental sources in the form of scanned documents, Document Parsing has emerged as a task to process document images and convert them into machine-readable structured representations, usually markup language. However, current models focus on scientific and business documents, and most of them consider only single-paged documents. To overcome this limitation, in this work, we propose {\mu}gat, an extension of the recently proposed Document parsing Nougat architecture, which can handle elements spanning over the single page limits. Specifically, we adapt Nougat to process a larger, multi-page context, consisting of the previous and the following page, while parsing the current page. Experimental results, both qualitative and quantitative, demonstrate the effectiveness of our proposed approach also in the case of the challenging Regesta Pontificum Romanorum.
comment: Accepted at ECCV Workshop "AI4DH: Artificial Intelligence for Digital Humanities"
☆ RIDE: Boosting 3D Object Detection for LiDAR Point Clouds via Rotation-Invariant Analysis
The rotation robustness property has drawn much attention to point cloud analysis, whereas it still poses a critical challenge in 3D object detection. When subjected to arbitrary rotation, most existing detectors fail to produce expected outputs due to the poor rotation robustness. In this paper, we present RIDE, a pioneering exploration of Rotation-Invariance for the 3D LiDAR-point-based object DEtector, with the key idea of designing rotation-invariant features from LiDAR scenes and then effectively incorporating them into existing 3D detectors. Specifically, we design a bi-feature extractor that extracts (i) object-aware features though sensitive to rotation but preserve geometry well, and (ii) rotation-invariant features, which lose geometric information to a certain extent but are robust to rotation. These two kinds of features complement each other to decode 3D proposals that are robust to arbitrary rotations. Particularly, our RIDE is compatible and easy to plug into the existing one-stage and two-stage 3D detectors, and boosts both detection performance and rotation robustness. Extensive experiments on the standard benchmarks showcase that the mean average precision (mAP) and rotation robustness can be significantly boosted by integrating with our RIDE, with +5.6% mAP and 53% rotation robustness improvement on KITTI, +5.1% and 28% improvement correspondingly on nuScenes. The code will be available soon.
☆ Can SAR improve RSVQA performance?
Remote sensing visual question answering (RSVQA) has been involved in several research in recent years, leading to an increase in new methods. RSVQA automatically extracts information from satellite images, so far only optical, and a question to automatically search for the answer in the image and provide it in a textual form. In our research, we study whether Synthetic Aperture Radar (SAR) images can be beneficial to this field. We divide our study into three phases which include classification methods and VQA. In the first one, we explore the classification results of SAR alone and investigate the best method to extract information from SAR data. Then, we study the combination of SAR and optical data. In the last phase, we investigate how SAR images and a combination of different modalities behave in RSVQA compared to a method only using optical images. We conclude that adding the SAR modality leads to improved performances, although further research on using SAR data to automatically answer questions is needed as well as more balanced datasets.
comment: 6 pages, 4 figures
☆ MMDRFuse: Distilled Mini-Model with Dynamic Refresh for Multi-Modality Image Fusion
In recent years, Multi-Modality Image Fusion (MMIF) has been applied to many fields, which has attracted many scholars to endeavour to improve the fusion performance. However, the prevailing focus has predominantly been on the architecture design, rather than the training strategies. As a low-level vision task, image fusion is supposed to quickly deliver output images for observation and supporting downstream tasks. Thus, superfluous computational and storage overheads should be avoided. In this work, a lightweight Distilled Mini-Model with a Dynamic Refresh strategy (MMDRFuse) is proposed to achieve this objective. To pursue model parsimony, an extremely small convolutional network with a total of 113 trainable parameters (0.44 KB) is obtained by three carefully designed supervisions. First, digestible distillation is constructed by emphasising external spatial feature consistency, delivering soft supervision with balanced details and saliency for the target network. Second, we develop a comprehensive loss to balance the pixel, gradient, and perception clues from the source images. Third, an innovative dynamic refresh training strategy is used to collaborate history parameters and current supervision during training, together with an adaptive adjust function to optimise the fusion network. Extensive experiments on several public datasets demonstrate that our method exhibits promising advantages in terms of model efficiency and complexity, with superior performance in multiple image fusion tasks and downstream pedestrian detection application. The code of this work is publicly available at https://github.com/yanglinDeng/MMDRFuse.
comment: 10 pages, 8 figures, accpeted by ACM International Conference on Multimedia 2024(Oral)
☆ Transfer Learning from Simulated to Real Scenes for Monocular 3D Object Detection ECCV'24
Accurately detecting 3D objects from monocular images in dynamic roadside scenarios remains a challenging problem due to varying camera perspectives and unpredictable scene conditions. This paper introduces a two-stage training strategy to address these challenges. Our approach initially trains a model on the large-scale synthetic dataset, RoadSense3D, which offers a diverse range of scenarios for robust feature learning. Subsequently, we fine-tune the model on a combination of real-world datasets to enhance its adaptability to practical conditions. Experimental results of the Cube R-CNN model on challenging public benchmarks show a remarkable improvement in detection performance, with a mean average precision rising from 0.26 to 12.76 on the TUM Traffic A9 Highway dataset and from 2.09 to 6.60 on the DAIR-V2X-I dataset when performing transfer learning. Code, data, and qualitative video results are available on the project website: https://roadsense3d.github.io.
comment: 18 pages. Accepted for ECVA European Conference on Computer Vision 2024 (ECCV'24)
☆ CSAD: Unsupervised Component Segmentation for Logical Anomaly Detection
To improve logical anomaly detection, some previous works have integrated segmentation techniques with conventional anomaly detection methods. Although these methods are effective, they frequently lead to unsatisfactory segmentation results and require manual annotations. To address these drawbacks, we develop an unsupervised component segmentation technique that leverages foundation models to autonomously generate training labels for a lightweight segmentation network without human labeling. Integrating this new segmentation technique with our proposed Patch Histogram module and the Local-Global Student-Teacher (LGST) module, we achieve a detection AUROC of 95.3% in the MVTec LOCO AD dataset, which surpasses previous SOTA methods. Furthermore, our proposed method provides lower latency and higher throughput than most existing approaches.
☆ Can Visual Language Models Replace OCR-Based Visual Question Answering Pipelines in Production? A Case Study in Retail
Most production-level deployments for Visual Question Answering (VQA) tasks are still build as processing pipelines of independent steps including image pre-processing, object- and text detection, Optical Character Recognition (OCR) and (mostly supervised) object classification. However, the recent advances in vision Foundation Models [25] and Vision Language Models (VLMs) [23] raise the question if these custom trained, multi-step approaches can be replaced with pre-trained, single-step VLMs. This paper analyzes the performance and limits of various VLMs in the context of VQA and OCR [5, 9, 12] tasks in a production-level scenario. Using data from the Retail-786k [10] dataset, we investigate the capabilities of pre-trained VLMs to answer detailed questions about advertised products in images. Our study includes two commercial models, GPT-4V [16] and GPT-4o [17], as well as four open-source models: InternVL [5], LLaVA 1.5 [12], LLaVA-NeXT [13], and CogAgent [9]. Our initial results show, that there is in general no big performance gap between open-source and commercial models. However, we observe a strong task dependent variance in VLM performance: while most models are able to answer questions regarding the product brand and price with high accuracy, they completely fail at the same time to correctly identity the specific product name or discount. This indicates the problem of VLMs to solve fine-grained classification tasks as well to model the more abstract concept of discounts.
☆ Geometry-guided Feature Learning and Fusion for Indoor Scene Reconstruction ICCV2023
In addition to color and textural information, geometry provides important cues for 3D scene reconstruction. However, current reconstruction methods only include geometry at the feature level thus not fully exploiting the geometric information. In contrast, this paper proposes a novel geometry integration mechanism for 3D scene reconstruction. Our approach incorporates 3D geometry at three levels, i.e. feature learning, feature fusion, and network supervision. First, geometry-guided feature learning encodes geometric priors to contain view-dependent information. Second, a geometry-guided adaptive feature fusion is introduced which utilizes the geometric priors as a guidance to adaptively generate weights for multiple views. Third, at the supervision level, taking the consistency between 2D and 3D normals into account, a consistent 3D normal loss is designed to add local constraints. Large-scale experiments are conducted on the ScanNet dataset, showing that volumetric methods with our geometry integration mechanism outperform state-of-the-art methods quantitatively as well as qualitatively. Volumetric methods with ours also show good generalization on the 7-Scenes and TUM RGB-D datasets.
comment: Accepted by ICCV2023
☆ ES-PTAM: Event-based Stereo Parallel Tracking and Mapping
Visual Odometry (VO) and SLAM are fundamental components for spatial perception in mobile robots. Despite enormous progress in the field, current VO/SLAM systems are limited by their sensors' capability. Event cameras are novel visual sensors that offer advantages to overcome the limitations of standard cameras, enabling robots to expand their operating range to challenging scenarios, such as high-speed motion and high dynamic range illumination. We propose a novel event-based stereo VO system by combining two ideas: a correspondence-free mapping module that estimates depth by maximizing ray density fusion and a tracking module that estimates camera poses by maximizing edge-map alignment. We evaluate the system comprehensively on five real-world datasets, spanning a variety of camera types (manufacturers and spatial resolutions) and scenarios (driving, flying drone, hand-held, egocentric, etc). The quantitative and qualitative results demonstrate that our method outperforms the state of the art in majority of the test sequences by a margin, e.g., trajectory error reduction of 45% on RPG dataset, 61% on DSEC dataset, and 21% on TUM-VIE dataset. To benefit the community and foster research on event-based perception systems, we release the source code and results: https://github.com/tub-rip/ES-PTAM
comment: 17 pages, 7 figures, 4 tables, https://github.com/tub-rip/ES-PTAM
☆ On the Benefits of Visual Stabilization for Frame- and Event-based Perception
Vision-based perception systems are typically exposed to large orientation changes in different robot applications. In such conditions, their performance might be compromised due to the inherent complexity of processing data captured under challenging motion. Integration of mechanical stabilizers to compensate for the camera rotation is not always possible due to the robot payload constraints. This paper presents a processing-based stabilization approach to compensate the camera's rotational motion both on events and on frames (i.e., images). Assuming that the camera's attitude is available, we evaluate the benefits of stabilization in two perception applications: feature tracking and estimating the translation component of the camera's ego-motion. The validation is performed using synthetic data and sequences from well-known event-based vision datasets. The experiments unveil that stabilization can improve feature tracking and camera ego-motion estimation accuracy in 27.37% and 34.82%, respectively. Concurrently, stabilization can reduce the processing time of computing the camera's linear velocity by at least 25%. Code is available at https://github.com/tub-rip/visual_stabilization
comment: 8 pages, 4 figures, 4 tables, https://github.com/tub-rip/visual_stabilization
☆ Hierarchical Visual Categories Modeling: A Joint Representation Learning and Density Estimation Framework for Out-of-Distribution Detection ICCV2023
Detecting out-of-distribution inputs for visual recognition models has become critical in safe deep learning. This paper proposes a novel hierarchical visual category modeling scheme to separate out-of-distribution data from in-distribution data through joint representation learning and statistical modeling. We learn a mixture of Gaussian models for each in-distribution category. There are many Gaussian mixture models to model different visual categories. With these Gaussian models, we design an in-distribution score function by aggregating multiple Mahalanobis-based metrics. We don't use any auxiliary outlier data as training samples, which may hurt the generalization ability of out-of-distribution detection algorithms. We split the ImageNet-1k dataset into ten folds randomly. We use one fold as the in-distribution dataset and the others as out-of-distribution datasets to evaluate the proposed method. We also conduct experiments on seven popular benchmarks, including CIFAR, iNaturalist, SUN, Places, Textures, ImageNet-O, and OpenImage-O. Extensive experiments indicate that the proposed method outperforms state-of-the-art algorithms clearly. Meanwhile, we find that our visual representation has a competitive performance when compared with features learned by classical methods. These results demonstrate that the proposed method hasn't weakened the discriminative ability of visual recognition models and keeps high efficiency in detecting out-of-distribution samples.
comment: Accepted by ICCV2023
☆ Temporal Attention for Cross-View Sequential Image Localization IROS 2024
This paper introduces a novel approach to enhancing cross-view localization, focusing on the fine-grained, sequential localization of street-view images within a single known satellite image patch, a significant departure from traditional one-to-one image retrieval methods. By expanding to sequential image fine-grained localization, our model, equipped with a novel Temporal Attention Module (TAM), leverages contextual information to significantly improve sequential image localization accuracy. Our method shows substantial reductions in both mean and median localization errors on the Cross-View Image Sequence (CVIS) dataset, outperforming current state-of-the-art single-image localization techniques. Additionally, by adapting the KITTI-CVL dataset into sequential image sets, we not only offer a more realistic dataset for future research but also demonstrate our model's robust generalization capabilities across varying times and areas, evidenced by a 75.3% reduction in mean distance error in cross-view sequential image localization.
comment: Accepted to IROS 2024
☆ TagOOD: A Novel Approach to Out-of-Distribution Detection via Vision-Language Representations and Class Center Learning
Multimodal fusion, leveraging data like vision and language, is rapidly gaining traction. This enriched data representation improves performance across various tasks. Existing methods for out-of-distribution (OOD) detection, a critical area where AI models encounter unseen data in real-world scenarios, rely heavily on whole-image features. These image-level features can include irrelevant information that hinders the detection of OOD samples, ultimately limiting overall performance. In this paper, we propose \textbf{TagOOD}, a novel approach for OOD detection that leverages vision-language representations to achieve label-free object feature decoupling from whole images. This decomposition enables a more focused analysis of object semantics, enhancing OOD detection performance. Subsequently, TagOOD trains a lightweight network on the extracted object features to learn representative class centers. These centers capture the central tendencies of IND object classes, minimizing the influence of irrelevant image features during OOD detection. Finally, our approach efficiently detects OOD samples by calculating distance-based metrics as OOD scores between learned centers and test samples. We conduct extensive experiments to evaluate TagOOD on several benchmark datasets and demonstrate its superior performance compared to existing OOD detection methods. This work presents a novel perspective for further exploration of multimodal information utilization in OOD detection, with potential applications across various tasks.
comment: Accepted by ACMMM2024
☆ Generalization Capabilities of Neural Cellular Automata for Medical Image Segmentation: A Robust and Lightweight Approach
In the field of medical imaging, the U-Net architecture, along with its variants, has established itself as a cornerstone for image segmentation tasks, particularly due to its strong performance when trained on limited datasets. Despite its impressive performance on identically distributed (in-domain) data, U-Nets exhibit a significant decline in performance when tested on data that deviates from the training distribution, out-of-distribution (out-of-domain) data. Current methodologies predominantly address this issue by employing generalization techniques that hinge on various forms of regularization, which have demonstrated moderate success in specific scenarios. This paper, however, ventures into uncharted territory by investigating the implications of utilizing models that are smaller by three orders of magnitude (i.e., x1000) compared to a conventional U-Net. A reduction of this size in U-net parameters typically adversely affects both in-domain and out-of-domain performance, possibly due to a significantly reduced receptive field. To circumvent this issue, we explore the concept of Neural Cellular Automata (NCA), which, despite its simpler model structure, can attain larger receptive fields through recursive processes. Experimental results on two distinct datasets reveal that NCA outperforms traditional methods in terms of generalization, while still maintaining a commendable IID performance.
☆ Divide, Conquer and Combine: A Training-Free Framework for High-Resolution Image Perception in Multimodal Large Language Models
Multimodal large language models (MLLMs) have experienced significant advancements recently, but still struggle to recognize and interpret intricate details in high-resolution (HR) images effectively. While state-of-the-art (SOTA) MLLMs claim to process images at 4K resolution, existing MLLM benchmarks only support up to 2K, leaving the capabilities of SOTA models on true HR images largely untested. Furthermore, existing methods for enhancing HR image perception in MLLMs rely on computationally expensive visual instruction tuning. To address these limitations, we introduce HR-Bench, the first deliberately designed benchmark to rigorously evaluate MLLM performance on 4K&8K images. Through extensive experiments, we demonstrate that while downsampling HR images leads to vision information loss, leveraging complementary modalities, e.g., text, can effectively compensate for this loss. Building upon this insight, we propose Divide, Conquer and Combine (DC$^2$), a novel training-free framework for enhancing MLLM perception of HR images. DC$^2$ follows a three-staged approach: 1) Divide: recursively partitioning the HR image into patches and merging similar patches to minimize computational overhead, 2) Conquer: leveraging the MLLM to generate accurate textual descriptions for each image patch, and 3) Combine: utilizing the generated text descriptions to enhance the MLLM's understanding of the overall HR image. Extensive experiments show that: 1) the SOTA MLLM achieves 63% accuracy, which is markedly lower than the 87% accuracy achieved by humans on HR-Bench; 2) our DC$^2$ brings consistent and significant improvements (a relative increase of +6% on HR-Bench and +8% on general multimodal benchmarks). The benchmark and code will be released to facilitate the multimodal R&D community.
☆ Latent Relationship Mining of Glaucoma Biomarkers: a TRI-LSTM based Deep Learning
In recently years, a significant amount of research has been conducted on applying deep learning methods for glaucoma classification and detection. However, the explainability of those established machine learning models remains a big concern. In this research, in contrast, we learn from cognitive science concept and study how ophthalmologists judge glaucoma detection. Simulating experts' efforts, we propose a hierarchical decision making system, centered around a holistic set of carefully designed biomarker-oriented machine learning models. While biomarkers represent the key indicators of how ophthalmologists identify glaucoma, they usually exhibit latent inter-relations. We thus construct a time series model, named TRI-LSTM, capable of calculating and uncovering potential and latent relationships among various biomarkers of glaucoma. Our model is among the first efforts to explore the intrinsic connections among glaucoma biomarkers. We monitor temporal relationships in patients' disease states over time and to capture and retain the progression of disease-relevant clinical information from prior visits, thereby enriching biomarker's potential relationships. Extensive experiments over real-world dataset have demonstrated the effectiveness of the proposed model.
comment: 9 pages, 4 images
☆ ConsistencyTrack: A Robust Multi-Object Tracker with a Generation Strategy of Consistency Model
Multi-object tracking (MOT) is a critical technology in computer vision, designed to detect multiple targets in video sequences and assign each target a unique ID per frame. Existed MOT methods excel at accurately tracking multiple objects in real-time across various scenarios. However, these methods still face challenges such as poor noise resistance and frequent ID switches. In this research, we propose a novel ConsistencyTrack, joint detection and tracking(JDT) framework that formulates detection and association as a denoising diffusion process on perturbed bounding boxes. This progressive denoising strategy significantly improves the model's noise resistance. During the training phase, paired object boxes within two adjacent frames are diffused from ground-truth boxes to a random distribution, and then the model learns to detect and track by reversing this process. In inference, the model refines randomly generated boxes into detection and tracking results through minimal denoising steps. ConsistencyTrack also introduces an innovative target association strategy to address target occlusion. Experiments on the MOT17 and DanceTrack datasets demonstrate that ConsistencyTrack outperforms other compared methods, especially better than DiffusionTrack in inference speed and other performance metrics. Our code is available at https://github.com/Tankowa/ConsistencyTrack.
comment: arXiv admin note: text overlap with arXiv:2308.09905 by other authors
☆ Kangaroo: A Powerful Video-Language Model Supporting Long-context Video Input
Rapid advancements have been made in extending Large Language Models (LLMs) to Large Multi-modal Models (LMMs). However, extending input modality of LLMs to video data remains a challenging endeavor, especially for long videos. Due to insufficient access to large-scale high-quality video data and the excessive compression of visual features, current methods exhibit limitations in effectively processing long videos. In this paper, we introduce Kangaroo, a powerful Video LMM aimed at addressing these challenges. Confronted with issue of inadequate training data, we develop a data curation system to build a large-scale dataset with high-quality annotations for vision-language pre-training and instruction tuning. In addition, we design a curriculum training pipeline with gradually increasing resolution and number of input frames to accommodate long videos. Evaluation results demonstrate that, with 8B parameters, Kangaroo achieves state-of-the-art performance across a variety of video understanding benchmarks while exhibiting competitive results on others. Particularly, on benchmarks specialized for long videos, Kangaroo excels some larger models with over 10B parameters and proprietary models.
☆ Ray-Distance Volume Rendering for Neural Scene Reconstruction ECCV2024
Existing methods in neural scene reconstruction utilize the Signed Distance Function (SDF) to model the density function. However, in indoor scenes, the density computed from the SDF for a sampled point may not consistently reflect its real importance in volume rendering, often due to the influence of neighboring objects. To tackle this issue, our work proposes a novel approach for indoor scene reconstruction, which instead parameterizes the density function with the Signed Ray Distance Function (SRDF). Firstly, the SRDF is predicted by the network and transformed to a ray-conditioned density function for volume rendering. We argue that the ray-specific SRDF only considers the surface along the camera ray, from which the derived density function is more consistent to the real occupancy than that from the SDF. Secondly, although SRDF and SDF represent different aspects of scene geometries, their values should share the same sign indicating the underlying spatial occupancy. Therefore, this work introduces a SRDF-SDF consistency loss to constrain the signs of the SRDF and SDF outputs. Thirdly, this work proposes a self-supervised visibility task, introducing the physical visibility geometry to the reconstruction task. The visibility task combines prior from predicted SRDF and SDF as pseudo labels, and contributes to generating more accurate 3D geometry. Our method implemented with different representations has been validated on indoor datasets, achieving improved performance in both reconstruction and view synthesis.
comment: Accepted by ECCV2024
☆ A Simple Baseline with Single-encoder for Referring Image Segmentation
Referring image segmentation (RIS) requires dense vision-language interactions between visual pixels and textual words to segment objects based on a given description. However, commonly adapted dual-encoders in RIS, e.g., Swin transformer and BERT (uni-modal encoders) or CLIP (a multi-modal dual-encoder), lack dense multi-modal interactions during pre-training, leading to a gap with a pixel-level RIS task. To bridge this gap, existing RIS methods often rely on multi-modal fusion modules that interact two encoders, but this approach leads to high computational costs. In this paper, we present a novel RIS method with a single-encoder, i.e., BEiT-3, maximizing the potential of shared self-attention across all framework components. This enables seamless interactions of two modalities from input to final prediction, producing granularly aligned multi-modal features. Furthermore, we propose lightweight yet effective decoder modules, a Shared FPN and a Shared Mask Decoder, which contribute to the high efficiency of our model. Our simple baseline with a single encoder achieves outstanding performances on the RIS benchmark datasets while maintaining computational efficiency, compared to the most recent SoTA methods based on dual-encoders.
comment: ArXiv pre-print
☆ Depth-Weighted Detection of Behaviours of Risk in People with Dementia using Cameras
The behavioural and psychological symptoms of dementia, such as agitation and aggression, present a significant health and safety risk in residential care settings. Many care facilities have video cameras in place for digital monitoring of public spaces, which can be leveraged to develop an automated behaviours of risk detection system that can alert the staff to enable timely intervention and prevent the situation from escalating. However, one of the challenges in our previous study was the presence of false alarms due to obstruction of view by activities happening close to the camera. To address this issue, we proposed a novel depth-weighted loss function to train a customized convolutional autoencoder to enforce equivalent importance to the events happening both near and far from the cameras; thus, helping to reduce false alarms and making the method more suitable for real-world deployment. The proposed method was trained using data from nine participants with dementia across three cameras situated in a specialized dementia unit and achieved an area under the curve of receiver operating characteristic of $0.852$, $0.81$ and $0.768$ for the three cameras. Ablation analysis was conducted for the individual components of the proposed method and the performance of the proposed method was investigated for participant-specific and sex-specific behaviours of risk detection. The proposed method performed reasonably well in detecting behaviours of risk in people with dementia motivating further research toward the development of a behaviours of risk detection system suitable for deployment in video surveillance systems in care facilities.
☆ Continual-learning-based framework for structural damage recognition
Multi-damage is common in reinforced concrete structures and leads to the requirement of large number of neural networks, parameters and data storage, if convolutional neural network (CNN) is used for damage recognition. In addition, conventional CNN experiences catastrophic forgetting and training inefficiency as the number of tasks increases during continual learning, leading to large accuracy decrease of previous learned tasks. To address these problems, this study proposes a continuallearning-based damage recognition model (CLDRM) which integrates the learning without forgetting continual learning method into the ResNet-34 architecture for the recognition of damages in RC structures as well as relevant structural components. Three experiments for four recognition tasks were designed to validate the feasibility and effectiveness of the CLDRM framework. In this way, it reduces both the prediction time and data storage by about 75% in four tasks of continuous learning. Three experiments for four recognition tasks were designed to validate the feasibility and effectiveness of the CLDRM framework. By gradual feature fusion, CLDRM outperformed other methods by managed to achieve high accuracy in the damage recognition and classification. As the number of recognition tasks increased, CLDRM also experienced smaller decrease of the previous learned tasks. Results indicate that the CLDRM framework successfully performs damage recognition and classification with reasonable accuracy and effectiveness.
comment: 18 pages, 12 figures
☆ RoboSense: Large-scale Dataset and Benchmark for Multi-sensor Low-speed Autonomous Driving
Robust object detection and tracking under arbitrary sight of view is challenging yet essential for the development of Autonomous Vehicle technology. With the growing demand of unmanned function vehicles, near-field scene understanding becomes an important research topic in the areas of low-speed autonomous driving. Due to the complexity of driving conditions and diversity of near obstacles such as blind spots and high occlusion, the perception capability of near-field environment is still inferior than its farther counterpart. To further enhance the intelligent ability of unmanned vehicles, in this paper, we construct a multimodal data collection platform based on 3 main types of sensors (Camera, LiDAR and Fisheye), which supports flexible sensor configurations to enable dynamic sight of view for ego vehicle, either global view or local view. Meanwhile, a large-scale multi-sensor dataset is built, named RoboSense, to facilitate near-field scene understanding. RoboSense contains more than 133K synchronized data with 1.4M 3D bounding box and IDs annotated in the full $360^{\circ}$ view, forming 216K trajectories across 7.6K temporal sequences. It has $270\times$ and $18\times$ as many annotations of near-field obstacles within 5$m$ as the previous single-vehicle datasets such as KITTI and nuScenes. Moreover, we define a novel matching criterion for near-field 3D perception and prediction metrics. Based on RoboSense, we formulate 6 popular tasks to facilitate the future development of related research, where the detailed data analysis as well as benchmarks are also provided accordingly.
☆ NAS-BNN: Neural Architecture Search for Binary Neural Networks
Binary Neural Networks (BNNs) have gained extensive attention for their superior inferencing efficiency and compression ratio compared to traditional full-precision networks. However, due to the unique characteristics of BNNs, designing a powerful binary architecture is challenging and often requires significant manpower. A promising solution is to utilize Neural Architecture Search (NAS) to assist in designing BNNs, but current NAS methods for BNNs are relatively straightforward and leave a performance gap between the searched models and manually designed ones. To address this gap, we propose a novel neural architecture search scheme for binary neural networks, named NAS-BNN. We first carefully design a search space based on the unique characteristics of BNNs. Then, we present three training strategies, which significantly enhance the training of supernet and boost the performance of all subnets. Our discovered binary model family outperforms previous BNNs for a wide range of operations (OPs) from 20M to 200M. For instance, we achieve 68.20% top-1 accuracy on ImageNet with only 57M OPs. In addition, we validate the transferability of these searched BNNs on the object detection task, and our binary detectors with the searched BNNs achieve a novel state-of-the-art result, e.g., 31.6% mAP with 370M OPs, on MS COCO dataset. The source code and models will be released at https://github.com/VDIGPKU/NAS-BNN.
comment: 23 pages
☆ Dynamic Reconstruction from Neuromorphic Data
Unlike traditional cameras which synchronously register pixel intensity, neuromorphic sensors only register `changes' at pixels where a change is occurring asynchronously. This enables neuromorphic sensors to sample at a micro-second level and efficiently capture the dynamics. Since, only sequences of asynchronous event changes are recorded rather than brightness intensities over time, many traditional image processing techniques cannot be directly applied. Furthermore, existing approaches, including the ones recently introduced by the authors, use traditional images combined with neuromorphic event data to carry out reconstructions. The aim of this work is introduce an optimization based approach to reconstruct images and dynamics only from the neuromoprhic event data without any additional knowledge of the events. Each pixel is modeled temporally. The experimental results on real data highlight the efficacy of the presented approach, paving the way for efficient and accurate processing of neuromorphic sensor data in real-world applications.
☆ Hand1000: Generating Realistic Hands from Text with Only 1,000 Images
Text-to-image generation models have achieved remarkable advancements in recent years, aiming to produce realistic images from textual descriptions. However, these models often struggle with generating anatomically accurate representations of human hands. The resulting images frequently exhibit issues such as incorrect numbers of fingers, unnatural twisting or interlacing of fingers, or blurred and indistinct hands. These issues stem from the inherent complexity of hand structures and the difficulty in aligning textual descriptions with precise visual depictions of hands. To address these challenges, we propose a novel approach named Hand1000 that enables the generation of realistic hand images with target gesture using only 1,000 training samples. The training of Hand1000 is divided into three stages with the first stage aiming to enhance the model's understanding of hand anatomy by using a pre-trained hand gesture recognition model to extract gesture representation. The second stage further optimizes text embedding by incorporating the extracted hand gesture representation, to improve alignment between the textual descriptions and the generated hand images. The third stage utilizes the optimized embedding to fine-tune the Stable Diffusion model to generate realistic hand images. In addition, we construct the first publicly available dataset specifically designed for text-to-hand image generation. Based on the existing hand gesture recognition dataset, we adopt advanced image captioning models and LLaMA3 to generate high-quality textual descriptions enriched with detailed gesture information. Extensive experiments demonstrate that Hand1000 significantly outperforms existing models in producing anatomically correct hand images while faithfully representing other details in the text, such as faces, clothing, and colors.
comment: Project page https://haozhuo-zhang.github.io/Hand1000-project-page/
☆ Avoiding Generative Model Writer's Block With Embedding Nudging
Generative image models, since introduction, have become a global phenomenon. From new arts becoming possible to new vectors of abuse, many new capabilities have become available. One of the challenging issues with generative models is controlling the generation process specially to prevent specific generations classes or instances . There are several reasons why one may want to control the output of generative models, ranging from privacy and safety concerns to application limitations or user preferences To address memorization and privacy challenges, there has been considerable research dedicated to filtering prompts or filtering the outputs of these models. What all these solutions have in common is that at the end of the day they stop the model from producing anything, hence limiting the usability of the model. In this paper, we propose a method for addressing this usability issue by making it possible to steer away from unwanted concepts (when detected in model's output) and still generating outputs. In particular we focus on the latent diffusion image generative models and how one can prevent them to generate particular images while generating similar images with limited overhead. We focus on mitigating issues like image memorization, demonstrating our technique's effectiveness through qualitative and quantitative evaluations. Our method successfully prevents the generation of memorized training images while maintaining comparable image quality and relevance to the unmodified model.
☆ VLM4Bio: A Benchmark Dataset to Evaluate Pretrained Vision-Language Models for Trait Discovery from Biological Images
Images are increasingly becoming the currency for documenting biodiversity on the planet, providing novel opportunities for accelerating scientific discoveries in the field of organismal biology, especially with the advent of large vision-language models (VLMs). We ask if pre-trained VLMs can aid scientists in answering a range of biologically relevant questions without any additional fine-tuning. In this paper, we evaluate the effectiveness of 12 state-of-the-art (SOTA) VLMs in the field of organismal biology using a novel dataset, VLM4Bio, consisting of 469K question-answer pairs involving 30K images from three groups of organisms: fishes, birds, and butterflies, covering five biologically relevant tasks. We also explore the effects of applying prompting techniques and tests for reasoning hallucination on the performance of VLMs, shedding new light on the capabilities of current SOTA VLMs in answering biologically relevant questions using images. The code and datasets for running all the analyses reported in this paper can be found at https://github.com/sammarfy/VLM4Bio.
comment: 36 pages, 37 figures, 7 tables
☆ Does Data-Efficient Generalization Exacerbate Bias in Foundation Models? ECCV 2024
Foundation models have emerged as robust models with label efficiency in diverse domains. In medical imaging, these models contribute to the advancement of medical diagnoses due to the difficulty in obtaining labeled data. However, it is unclear whether using a large amount of unlabeled data, biased by the presence of sensitive attributes during pre-training, influences the fairness of the model. This research examines the bias in the Foundation model (RetFound) when it is applied to fine-tune the Brazilian Multilabel Ophthalmological Dataset (BRSET), which has a different population than the pre-training dataset. The model evaluation, in comparison with supervised learning, shows that the Foundation Model has the potential to reduce the gap between the maximum AUC and minimum AUC evaluations across gender and age groups. However, in a data-efficient generalization, the model increases the bias when the data amount decreases. These findings suggest that when deploying a Foundation Model in real-life scenarios with limited data, the possibility of fairness issues should be considered.
comment: Preprint of paper to be presented at Fairness and Ethics Towards Transparent AI: Facing the Challenge through Model Debiasing (FAILED) during ECCV 2024
☆ Single-Photon 3D Imaging with Equi-Depth Photon Histograms
Single-photon cameras present a promising avenue for high-resolution 3D imaging. They have ultra-high sensitivity -- down to individual photons -- and can record photon arrival times with extremely high (sub-nanosecond) resolution. Single-photon 3D cameras estimate the round-trip time of a laser pulse by forming equi-width (EW) histograms of detected photon timestamps. Acquiring and transferring such EW histograms requires high bandwidth and in-pixel memory, making SPCs less attractive in resource-constrained settings such as mobile devices and AR/VR headsets. In this work we propose a 3D sensing technique based on equi-depth (ED) histograms. ED histograms compress timestamp data more efficiently than EW histograms, reducing the bandwidth requirement. Moreover, to reduce the in-pixel memory requirement, we propose a lightweight algorithm to estimate ED histograms in an online fashion without explicitly storing the photon timestamps. This algorithm is amenable to future in-pixel implementations. We propose algorithms that process ED histograms to perform 3D computer-vision tasks of estimating scene distance maps and performing visual odometry under challenging conditions such as high ambient light. Our work paves the way towards lower bandwidth and reduced in-pixel memory requirements for SPCs, making them attractive for resource-constrained 3D vision applications. Project page: $\href{https://www.computational.camera/pedh}{https://www.computational.camera/pedh}$
☆ Using Backbone Foundation Model for Evaluating Fairness in Chest Radiography Without Demographic Data MICCAI 2024
Ensuring consistent performance across diverse populations and incorporating fairness into machine learning models are crucial for advancing medical image diagnostics and promoting equitable healthcare. However, many databases do not provide protected attributes or contain unbalanced representations of demographic groups, complicating the evaluation of model performance across different demographics and the application of bias mitigation techniques that rely on these attributes. This study aims to investigate the effectiveness of using the backbone of Foundation Models as an embedding extractor for creating groups that represent protected attributes, such as gender and age. We propose utilizing these groups in different stages of bias mitigation, including pre-processing, in-processing, and evaluation. Using databases in and out-of-distribution scenarios, it is possible to identify that the method can create groups that represent gender in both databases and reduce in 4.44% the difference between the gender attribute in-distribution and 6.16% in out-of-distribution. However, the model lacks robustness in handling age attributes, underscoring the need for more fundamentally fair and robust Foundation models. These findings suggest a role in promoting fairness assessment in scenarios where we lack knowledge of attributes, contributing to the development of more equitable medical diagnostics.
comment: Preprint of paper to be presented at Fairness of AI in Medical Imaging (FAIMI) during MICCAI 2024
☆ ChartEye: A Deep Learning Framework for Chart Information Extraction
The widespread use of charts and infographics as a means of data visualization in various domains has inspired recent research in automated chart understanding. However, information extraction from chart images is a complex multitasked process due to style variations and, as a consequence, it is challenging to design an end-to-end system. In this study, we propose a deep learning-based framework that provides a solution for key steps in the chart information extraction pipeline. The proposed framework utilizes hierarchal vision transformers for the tasks of chart-type and text-role classification, while YOLOv7 for text detection. The detected text is then enhanced using Super Resolution Generative Adversarial Networks to improve the recognition output of the OCR. Experimental results on a benchmark dataset show that our proposed framework achieves excellent performance at every stage with F1-scores of 0.97 for chart-type classification, 0.91 for text-role classification, and a mean Average Precision of 0.95 for text detection.
comment: 8 Pages, and 11 Figures
☆ Alternating Direction Method of Multipliers for Negative Binomial Model with The Weighted Difference of Anisotropic and Isotropic Total Variation ICME
In many applications such as medical imaging, the measurement data represent counts of photons hitting a detector. Such counts in low-photon settings are often modeled using a Poisson distribution. However, this model assumes that the mean and variance of the signal's noise distribution are equal. For overdispersed data where the variance is greater than the mean, the negative binomial distribution is a more appropriate statistical model. In this paper, we propose an optimization approach for recovering images corrupted by overdispersed Poisson noise. In particular, we incorporate a weighted anisotropic-isotropic total variation regularizer, which avoids staircasing artifacts that are introduced by a regular total variation penalty. We use an alternating direction method of multipliers, where each subproblem has a closed-form solution. Numerical experiments demonstrate the effectiveness of our proposed approach, especially in very photon-limited settings.
comment: 6 pages, Accepted by the IEEE International Conference on Multimedia and Expo (ICME)
☆ Negative Binomial Matrix Completion SP
Matrix completion focuses on recovering missing or incomplete information in matrices. This problem arises in various applications, including image processing and network analysis. Previous research proposed Poisson matrix completion for count data with noise that follows a Poisson distribution, which assumes that the mean and variance are equal. Since overdispersed count data, whose variance is greater than the mean, is more likely to occur in realistic settings, we assume that the noise follows the negative binomial (NB) distribution, which can be more general than the Poisson distribution. In this paper, we introduce NB matrix completion by proposing a nuclear-norm regularized model that can be solved by proximal gradient descent. In our experiments, we demonstrate that the NB model outperforms Poisson matrix completion in various noise and missing data settings on real data.
comment: 6 pages, Accepted by the IEEE International Workshop on Machine Learning for Signal Processing (MLSP)
☆ 3D Reconstruction with Spatial Memory
We present Spann3R, a novel approach for dense 3D reconstruction from ordered or unordered image collections. Built on the DUSt3R paradigm, Spann3R uses a transformer-based architecture to directly regress pointmaps from images without any prior knowledge of the scene or camera parameters. Unlike DUSt3R, which predicts per image-pair pointmaps each expressed in its local coordinate frame, Spann3R can predict per-image pointmaps expressed in a global coordinate system, thus eliminating the need for optimization-based global alignment. The key idea of Spann3R is to manage an external spatial memory that learns to keep track of all previous relevant 3D information. Spann3R then queries this spatial memory to predict the 3D structure of the next frame in a global coordinate system. Taking advantage of DUSt3R's pre-trained weights, and further fine-tuning on a subset of datasets, Spann3R shows competitive performance and generalization ability on various unseen datasets and can process ordered image collections in real time. Project page: \url{https://hengyiwang.github.io/projects/spanner}
comment: Project page: \url{https://hengyiwang.github.io/projects/spanner}
♻ ☆ HER2 and FISH Status Prediction in Breast Biopsy H&E-Stained Images Using Deep Learning
The current standard for detecting human epidermal growth factor receptor 2 (HER2) status in breast cancer patients relies on HER2 amplification, identified through fluorescence in situ hybridization (FISH) or immunohistochemistry (IHC). However, hematoxylin and eosin (H\&E) tumor stains are more widely available, and accurately predicting HER2 status using H\&E could reduce costs and expedite treatment selection. Deep Learning algorithms for H&E have shown effectiveness in predicting various cancer features and clinical outcomes, including moderate success in HER2 status prediction. In this work, we employed a customized weak supervision classification technique combined with MoCo-v2 contrastive learning to predict HER2 status. We trained our pipeline on 182 publicly available H&E Whole Slide Images (WSIs) from The Cancer Genome Atlas (TCGA), for which annotations by the pathology team at Yale School of Medicine are publicly available. Our pipeline achieved an Area Under the Curve (AUC) of 0.85 across four different test folds. Additionally, we tested our model on 44 H&E slides from the TCGA-BRCA dataset, which had an HER2 score of 2+ and included corresponding HER2 status and FISH test results. These cases are considered equivocal for IHC, requiring an expensive FISH test on their IHC slides for disambiguation. Our pipeline demonstrated an AUC of 0.81 on these challenging H&E slides. Reducing the need for FISH test can have significant implications in cancer treatment equity for underserved populations.
♻ ☆ SCP: Soft Conditional Prompt Learning for Aerial Video Action Recognition IROS2024
We present a new learning approach, Soft Conditional Prompt Learning (SCP), which leverages the strengths of prompt learning for aerial video action recognition. Our approach is designed to predict the action of each agent by helping the models focus on the descriptions or instructions associated with actions in the input videos for aerial/robot visual perception. Our formulation supports various prompts, including learnable prompts, auxiliary visual information, and large vision models to improve the recognition performance. We present a soft conditional prompt method that learns to dynamically generate prompts from a pool of prompt experts under different video inputs. By sharing the same objective with the task, our proposed SCP can optimize prompts that guide the model's predictions while explicitly learning input-invariant (prompt experts pool) and input-specific (data-dependent) prompt knowledge. In practice, we observe a 3.17-10.2% accuracy improvement on the aerial video datasets (Okutama, NECDrone), which consist of scenes with single-agent and multi-agent actions. We further evaluate our approach on ground camera videos to verify the effectiveness and generalization and achieve a 1.0-3.6% improvement on dataset SSV2. We integrate our method into the ROS2 as well.
comment: IROS2024
♻ ☆ Examining Pathological Bias in a Generative Adversarial Network Discriminator: A Case Study on a StyleGAN3 Model
Generative adversarial networks (GANs) generate photorealistic faces that are often indistinguishable by humans from real faces. While biases in machine learning models are often assumed to be due to biases in training data, we find pathological internal color and luminance biases in the discriminator of a pre-trained StyleGAN3-r model that are not explicable by the training data. We also find that the discriminator systematically stratifies scores by both image- and face-level qualities and that this disproportionately affects images across gender, race, and other categories. We examine axes common in research on stereotyping in social psychology.
♻ ☆ Infusion: internal diffusion for inpainting of dynamic textures and complex motion
Video inpainting is the task of filling a region in a video in a visually convincing manner. It is very challenging due to the high dimensionality of the data and the temporal consistency required for obtaining convincing results. Recently, diffusion models have shown impressive results in modeling complex data distributions, including images and videos. Such models remain nonetheless very expensive to train and to perform inference with, which strongly reduce their applicability to videos, and yields unreasonable computational loads. We show that in the case of video inpainting, thanks to the highly auto-similar nature of videos, the training data of a diffusion model can be restricted to the input video and still produce very satisfying results. This leads us to adopt an internal learning approach, which also allows us to greatly reduce the neural network size by about three orders of magnitude less than current diffusion models used for image inpainting. We also introduce a new method for efficient training and inference of diffusion models in the context of internal learning, by splitting the diffusion process into different learning intervals corresponding to different noise levels of the diffusion process. To the best of our knowledge, this is the first video inpainting method based purely on diffusion. Other methods require additional components such as optical flow estimation, which limits their performance in the case of dynamic textures and complex motions. We show qualitative and quantitative results, demonstrating that our method reaches state of the art performance in the case of dynamic textures and complex dynamic backgrounds.
comment: 11 pages, 10 figures
♻ ☆ Provable Probabilistic Imaging using Score-Based Generative Priors
Estimating high-quality images while also quantifying their uncertainty are two desired features in an image reconstruction algorithm for solving ill-posed inverse problems. In this paper, we propose plug-and-play Monte Carlo (PMC) as a principled framework for characterizing the space of possible solutions to a general inverse problem. PMC is able to incorporate expressive score-based generative priors for high-quality image reconstruction while also performing uncertainty quantification via posterior sampling. In particular, we develop two PMC algorithms that can be viewed as the sampling analogues of the traditional plug-and-play priors (PnP) and regularization by denoising (RED) algorithms. To improve the sampling efficiency, we introduce weighted annealing into these PMC algorithms, further developing two additional annealed PMC algorithms (APMC). We establish a theoretical analysis for characterizing the convergence behavior of PMC algorithms. Our analysis provides non-asymptotic stationarity guarantees in terms of the Fisher information, fully compatible with the joint presence of weighted annealing, potentially non-log-concave likelihoods, and imperfect score networks. We demonstrate the performance of the PMC algorithms on multiple representative inverse problems with both linear and nonlinear forward models. Experimental results show that PMC significantly improves reconstruction quality and enables high-fidelity uncertainty quantification.
♻ ☆ Imperceptible Protection against Style Imitation from Diffusion Models
Recent progress in diffusion models has profoundly enhanced the fidelity of image generation, but it has raised concerns about copyright infringements. While prior methods have introduced adversarial perturbations to prevent style imitation, most are accompanied by the degradation of artworks' visual quality. Recognizing the importance of maintaining this, we introduce a visually improved protection method while preserving its protection capability. To this end, we devise a perceptual map to highlight areas sensitive to human eyes, guided by instance-aware refinement, which refines the protection intensity accordingly. We also introduce a difficulty-aware protection by predicting how difficult the artwork is to protect and dynamically adjusting the intensity based on this. Lastly, we integrate a perceptual constraints bank to further improve the imperceptibility. Results show that our method substantially elevates the quality of the protected image without compromising on protection efficacy.
♻ ☆ u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model
Recent advancements in multi-modal large language models (MLLMs) have led to substantial improvements in visual understanding, primarily driven by sophisticated modality alignment strategies. However, predominant approaches prioritize global or regional comprehension, with less focus on fine-grained, pixel-level tasks. To address this gap, we introduce u-LLaVA, an innovative unifying multi-task framework that integrates pixel, regional, and global features to refine the perceptual faculties of MLLMs. We commence by leveraging an efficient modality alignment approach, harnessing both image and video datasets to bolster the model's foundational understanding across diverse visual contexts. Subsequently, a joint instruction tuning method with task-specific projectors and decoders for end-to-end downstream training is presented. Furthermore, this work contributes a novel mask-based multi-task dataset comprising 277K samples, crafted to challenge and assess the fine-grained perception capabilities of MLLMs. The overall framework is simple, effective, and achieves state-of-the-art performance across multiple benchmarks. We also make our model, data, and code publicly accessible at https://github.com/OPPOMKLab/u-LLaVA.
♻ ☆ Automated Real-World Sustainability Data Generation from Images of Buildings
When data on building features is unavailable, the task of determining how to improve that building in terms of carbon emissions becomes infeasible. We show that from only a set of images, a Large Language Model with appropriate prompt engineering and domain knowledge can successfully estimate a range of building features relevant for sustainability calculations. We compare our novel image-to-data method with a ground truth comprising real building data for 47 apartments and achieve accuracy better than a human performing the same task. We also demonstrate that the method can generate tailored recommendations to the owner on how best to improve their properties and discuss methods to scale the approach.
comment: 6 pages
♻ ☆ When Multi-Task Learning Meets Partial Supervision: A Computer Vision Review
Multi-Task Learning (MTL) aims to learn multiple tasks simultaneously while exploiting their mutual relationships. By using shared resources to simultaneously calculate multiple outputs, this learning paradigm has the potential to have lower memory requirements and inference times compared to the traditional approach of using separate methods for each task. Previous work in MTL has mainly focused on fully-supervised methods, as task relationships can not only be leveraged to lower the level of data-dependency of those methods but they can also improve performance. However, MTL introduces a set of challenges due to a complex optimisation scheme and a higher labeling requirement. This review focuses on how MTL could be utilised under different partial supervision settings to address these challenges. First, this review analyses how MTL traditionally uses different parameter sharing techniques to transfer knowledge in between tasks. Second, it presents the different challenges arising from such a multi-objective optimisation scheme. Third, it introduces how task groupings can be achieved by analysing task relationships. Fourth, it focuses on how partially supervised methods applied to MTL can tackle the aforementioned challenges. Lastly, this review presents the available datasets, tools and benchmarking results of such methods.
comment: Accepted by Proceedings of the IEEE
♻ ☆ Research on the Spatial Data Intelligent Foundation Model
This report focuses on spatial data intelligent large models, delving into the principles, methods, and cutting-edge applications of these models. It provides an in-depth discussion on the definition, development history, current status, and trends of spatial data intelligent large models, as well as the challenges they face. The report systematically elucidates the key technologies of spatial data intelligent large models and their applications in urban environments, aerospace remote sensing, geography, transportation, and other scenarios. Additionally, it summarizes the latest application cases of spatial data intelligent large models in themes such as urban development, multimodal systems, remote sensing, smart transportation, and resource environments. Finally, the report concludes with an overview and outlook on the development prospects of spatial data intelligent large models.
comment: V1 and V2 are in Chinese language, other versions are in English
♻ ☆ FRAME: A Modular Framework for Autonomous Map Merging: Advancements in the Field
In this article, a novel approach for merging 3D point cloud maps in the context of egocentric multi-robot exploration is presented. Unlike traditional methods, the proposed approach leverages state-of-the-art place recognition and learned descriptors to efficiently detect overlap between maps, eliminating the need for the time-consuming global feature extraction and feature matching process. The estimated overlapping regions are used to calculate a homogeneous rigid transform, which serves as an initial condition for the GICP point cloud registration algorithm to refine the alignment between the maps. The advantages of this approach include faster processing time, improved accuracy, and increased robustness in challenging environments. Furthermore, the effectiveness of the proposed framework is successfully demonstrated through multiple field missions of robot exploration in a variety of different underground environments.
comment: 28 pages, 24 figures. Accepted to the IEEE Transactions on Field Robotics
♻ ☆ Re-Nerfing: Improving Novel View Synthesis through Novel View Synthesis
Recent neural rendering and reconstruction techniques, such as NeRFs or Gaussian Splatting, have shown remarkable novel view synthesis capabilities but require hundreds of images of the scene from diverse viewpoints to render high-quality novel views. With fewer images available, these methods start to fail since they can no longer correctly triangulate the underlying 3D geometry and converge to a non-optimal solution. These failures can manifest as floaters or blurry renderings in sparsely observed areas of the scene. In this paper, we propose Re-Nerfing, a simple and general add-on approach that leverages novel view synthesis itself to tackle this problem. Using an already trained NVS method, we render novel views between existing ones and augment the training data to optimize a second model. This introduces additional multi-view constraints and allows the second model to converge to a better solution. With Re-Nerfing we achieve significant improvements upon multiple pipelines based on NeRF and Gaussian-Splatting in sparse view settings of the mip-NeRF 360 and LLFF datasets. Notably, Re-Nerfing does not require prior knowledge or extra supervision signals, making it a flexible and practical add-on.
comment: Code will be released upon acceptance
♻ ☆ FAST-LIVO2: Fast, Direct LiDAR-Inertial-Visual Odometry
This paper proposes FAST-LIVO2: a fast, direct LiDAR-inertial-visual odometry framework to achieve accurate and robust state estimation in SLAM tasks and provide great potential in real-time, onboard robotic applications. FAST-LIVO2 fuses the IMU, LiDAR and image measurements efficiently through an ESIKF. To address the dimension mismatch between the heterogeneous LiDAR and image measurements, we use a sequential update strategy in the Kalman filter. To enhance the efficiency, we use direct methods for both the visual and LiDAR fusion, where the LiDAR module registers raw points without extracting edge or plane features and the visual module minimizes direct photometric errors without extracting ORB or FAST corner features. The fusion of both visual and LiDAR measurements is based on a single unified voxel map where the LiDAR module constructs the geometric structure for registering new LiDAR scans and the visual module attaches image patches to the LiDAR points. To enhance the accuracy of image alignment, we use plane priors from the LiDAR points in the voxel map (and even refine the plane prior) and update the reference patch dynamically after new images are aligned. Furthermore, to enhance the robustness of image alignment, FAST-LIVO2 employs an on-demanding raycast operation and estimates the image exposure time in real time. Lastly, we detail three applications of FAST-LIVO2: UAV onboard navigation demonstrating the system's computation efficiency for real-time onboard navigation, airborne mapping showcasing the system's mapping accuracy, and 3D model rendering (mesh-based and NeRF-based) underscoring the suitability of our reconstructed dense map for subsequent rendering tasks. We open source our code, dataset and application on GitHub to benefit the robotics community.
comment: 30 pages, 31 figures, due to the limitation that 'The abstract field cannot exceed 1,920 characters', the abstract presented here is shorter than the one in the PDF file
♻ ☆ Automated Label Unification for Multi-Dataset Semantic Segmentation with GNNs
Deep supervised models possess significant capability to assimilate extensive training data, thereby presenting an opportunity to enhance model performance through training on multiple datasets. However, conflicts arising from different label spaces among datasets may adversely affect model performance. In this paper, we propose a novel approach to automatically construct a unified label space across multiple datasets using graph neural networks. This enables semantic segmentation models to be trained simultaneously on multiple datasets, resulting in performance improvements. Unlike existing methods, our approach facilitates seamless training without the need for additional manual reannotation or taxonomy reconciliation. This significantly enhances the efficiency and effectiveness of multi-dataset segmentation model training. The results demonstrate that our method significantly outperforms other multi-dataset training methods when trained on seven datasets simultaneously, and achieves state-of-the-art performance on the WildDash 2 benchmark.
♻ ☆ AIM 2024 Challenge on Compressed Video Quality Assessment: Methods and Results
Video quality assessment (VQA) is a crucial task in the development of video compression standards, as it directly impacts the viewer experience. This paper presents the results of the Compressed Video Quality Assessment challenge, held in conjunction with the Advances in Image Manipulation (AIM) workshop at ECCV 2024. The challenge aimed to evaluate the performance of VQA methods on a diverse dataset of 459 videos, encoded with 14 codecs of various compression standards (AVC/H.264, HEVC/H.265, AV1, and VVC/H.266) and containing a comprehensive collection of compression artifacts. To measure the methods performance, we employed traditional correlation coefficients between their predictions and subjective scores, which were collected via large-scale crowdsourced pairwise human comparisons. For training purposes, participants were provided with the Compressed Video Quality Assessment Dataset (CVQAD), a previously developed dataset of 1022 videos. Up to 30 participating teams registered for the challenge, while we report the results of 6 teams, which submitted valid final solutions and code for reproducing the results. Moreover, we calculated and present the performance of state-of-the-art VQA methods on the developed dataset, providing a comprehensive benchmark for future research. The dataset, results, and online leaderboard are publicly available at https://challenges.videoprocessing.ai/challenges/compressedvideo-quality-assessment.html.
♻ ☆ DeepMIF: Deep Monotonic Implicit Fields for Large-Scale LiDAR 3D Mapping
Recently, significant progress has been achieved in sensing real large-scale outdoor 3D environments, particularly by using modern acquisition equipment such as LiDAR sensors. Unfortunately, they are fundamentally limited in their ability to produce dense, complete 3D scenes. To address this issue, recent learning-based methods integrate neural implicit representations and optimizable feature grids to approximate surfaces of 3D scenes. However, naively fitting samples along raw LiDAR rays leads to noisy 3D mapping results due to the nature of sparse, conflicting LiDAR measurements. Instead, in this work we depart from fitting LiDAR data exactly, instead letting the network optimize a non-metric monotonic implicit field defined in 3D space. To fit our field, we design a learning system integrating a monotonicity loss that enables optimizing neural monotonic fields and leverages recent progress in large-scale 3D mapping. Our algorithm achieves high-quality dense 3D mapping performance as captured by multiple quantitative and perceptual measures and visual results obtained for Mai City, Newer College, and KITTI benchmarks. The code of our approach will be made publicly available.
comment: 8 pages, 6 figures
♻ ☆ FERGI: Automatic Annotation of User Preferences for Text-to-Image Generation from Spontaneous Facial Expression Reaction
Researchers have proposed to use data of human preference feedback to fine-tune text-to-image generative models. However, the scalability of human feedback collection has been limited by its reliance on manual annotation. Therefore, we develop and test a method to automatically score user preferences from their spontaneous facial expression reaction to the generated images. We collect a dataset of Facial Expression Reaction to Generated Images (FERGI) and show that the activations of multiple facial action units (AUs) are highly correlated with user evaluations of the generated images. We develop an FAU-Net (Facial Action Units Neural Network), which receives inputs from an AU estimation model, to automatically score user preferences for text-to-image generation based on their facial expression reactions, which is complementary to the pre-trained scoring models based on the input text prompts and generated images. Integrating our FAU-Net valence score with the pre-trained scoring models improves their consistency with human preferences. This method of automatic annotation with facial expression analysis can be potentially generalized to other generation tasks. The code is available at https://github.com/ShuangquanFeng/FERGI, and the dataset is also available at the same link for research purposes.
♻ ☆ CMTA: Cross-Modal Temporal Alignment for Event-guided Video Deblurring ECCV2024
Video deblurring aims to enhance the quality of restored results in motion-blurred videos by effectively gathering information from adjacent video frames to compensate for the insufficient data in a single blurred frame. However, when faced with consecutively severe motion blur situations, frame-based video deblurring methods often fail to find accurate temporal correspondence among neighboring video frames, leading to diminished performance. To address this limitation, we aim to solve the video deblurring task by leveraging an event camera with micro-second temporal resolution. To fully exploit the dense temporal resolution of the event camera, we propose two modules: 1) Intra-frame feature enhancement operates within the exposure time of a single blurred frame, iteratively enhancing cross-modality features in a recurrent manner to better utilize the rich temporal information of events, 2) Inter-frame temporal feature alignment gathers valuable long-range temporal information to target frames, aggregating sharp features leveraging the advantages of the events. In addition, we present a novel dataset composed of real-world blurred RGB videos, corresponding sharp videos, and event data. This dataset serves as a valuable resource for evaluating event-guided deblurring methods. We demonstrate that our proposed methods outperform state-of-the-art frame-based and event-based motion deblurring methods through extensive experiments conducted on both synthetic and real-world deblurring datasets. The code and dataset are available at https://github.com/intelpro/CMTA.
comment: Accepted in ECCV2024
♻ ☆ Training-Free Action Recognition and Goal Inference with Dynamic Frame Selection
We introduce VidTFS, a Training-free, open-vocabulary video goal and action inference framework that combines the frozen vision foundational model (VFM) and large language model (LLM) with a novel dynamic Frame Selection module. Our experiments demonstrate that the proposed frame selection module improves the performance of the framework significantly. We validate the performance of the proposed VidTFS on four widely used video datasets, including CrossTask, COIN, UCF101, and ActivityNet, covering goal inference and action recognition tasks under open-vocabulary settings without requiring any training or fine-tuning. The results show that VidTFS outperforms pretrained and instruction-tuned multimodal language models that directly stack LLM and VFM for downstream video inference tasks. Our VidTFS with its adaptability shows the future potential for generalizing to new training-free video inference tasks.
♻ ☆ Unrecognizable Yet Identifiable: Image Distortion with Preserved Embeddings
Biometric authentication systems play a crucial role in modern security systems. However, maintaining the balance of privacy and integrity of stored biometrics derivative data while achieving high recognition accuracy is often challenging. Addressing this issue, we introduce an innovative image transformation technique that effectively renders facial images unrecognizable to the eye while maintaining their identifiability by neural network models, which allows the distorted photo version to be stored for further verification. While initially intended for biometrics systems, the proposed methodology can be used in various artificial intelligence applications to distort the visual data and keep the derived features close. By experimenting with widely used datasets LFW and MNIST, we show that it is possible to build the distortion that changes the image content by more than 70% while maintaining the same recognition accuracy. We compare our method with previously state-of-the-art approaches. We publically release the source code.
♻ ☆ How Physics and Background Attributes Impact Video Transformers in Robotic Manipulation: A Case Study on Planar Pushing IROS 2024
As model and dataset sizes continue to scale in robot learning, the need to understand how the composition and properties of a dataset affect model performance becomes increasingly urgent to ensure cost-effective data collection and model performance. In this work, we empirically investigate how physics attributes (color, friction coefficient, shape) and scene background characteristics, such as the complexity and dynamics of interactions with background objects, influence the performance of Video Transformers in predicting planar pushing trajectories. We investigate three primary questions: How do physics attributes and background scene characteristics influence model performance? What kind of changes in attributes are most detrimental to model generalization? What proportion of fine-tuning data is required to adapt models to novel scenarios? To facilitate this research, we present CloudGripper-Push-1K, a large real-world vision-based robot pushing dataset comprising 1278 hours and 460,000 videos of planar pushing interactions with objects with different physics and background attributes. We also propose Video Occlusion Transformer (VOT), a generic modular video-transformer-based trajectory prediction framework which features 3 choices of 2D-spatial encoders as the subject of our case study. The dataset and source code are available at https://cloudgripper.org.
comment: IEEE/RSJ IROS 2024
♻ ☆ Evidential Deep Partial Multi-View Classification With Discount Fusion
Incomplete multi-view data classification poses significant challenges due to the common issue of missing views in real-world scenarios. Despite advancements, existing methods often fail to provide reliable predictions, largely due to the uncertainty of missing views and the inconsistent quality of imputed data. To tackle these problems, we propose a novel framework called Evidential Deep Partial Multi-View Classification (EDP-MVC). Initially, we use K-means imputation to address missing views, creating a complete set of multi-view data. However, the potential conflicts and uncertainties within this imputed data can affect the reliability of downstream inferences. To manage this, we introduce a Conflict-Aware Evidential Fusion Network (CAEFN), which dynamically adjusts based on the reliability of the evidence, ensuring trustworthy discount fusion and producing reliable inference outcomes. Comprehensive experiments on various benchmark datasets reveal EDP-MVC not only matches but often surpasses the performance of state-of-the-art methods.
comment: Ongoing work. 13 pages, 3 figures, 6 tables
♻ ☆ Urdu Digital Text Word Optical Character Recognition Using Permuted Auto Regressive Sequence Modeling
This research paper presents a novel word-level Optical Character Recognition (OCR) model developed specifically for digital Urdu text. The model utilizes transformer-based architectures and attention mechanisms to address the unique challenges of recognizing Urdu script, which includes handling a diverse range of text styles, fonts, and variations. Trained on a comprehensive dataset of approximately 160,000 Urdu text images, the model incorporates a permuted autoregressive sequence (PARSeq) architecture. This design enables context-aware inference and iterative refinement by leveraging bidirectional context information, significantly enhancing its ability to accurately recognize Urdu characters. The model achieves a character error rate (CER) of 0.178, highlighting its effectiveness and precision in real-world applications. However, the model has some limitations, such as difficulties with blurred images, non-horizontal orientations, and the presence of trailing punctuation marks, which can introduce noise into the recognition process. Addressing these challenges will be a key focus of future work. Future research will aim to further refine the model through advanced data augmentation techniques, optimization of hyperparameters, and the integration of context-aware language models, ultimately enhancing the model's performance and robustness in Urdu text recognition.
♻ ☆ When ControlNet Meets Inexplicit Masks: A Case Study of ControlNet on its Contour-following Ability
ControlNet excels at creating content that closely matches precise contours in user-provided masks. However, when these masks contain noise, as a frequent occurrence with non-expert users, the output would include unwanted artifacts. This paper first highlights the crucial role of controlling the impact of these inexplicit masks with diverse deterioration levels through in-depth analysis. Subsequently, to enhance controllability with inexplicit masks, an advanced Shape-aware ControlNet consisting of a deterioration estimator and a shape-prior modulation block is devised. The deterioration estimator assesses the deterioration factor of the provided masks. Then this factor is utilized in the modulation block to adaptively modulate the model's contour-following ability, which helps it dismiss the noise part in the inexplicit masks. Extensive experiments prove its effectiveness in encouraging ControlNet to interpret inaccurate spatial conditions robustly rather than blindly following the given contours. We showcase application scenarios like modifying shape priors and composable shape-controllable generation. Codes are soon available.
comment: Accepted by ACM-MM 2024
♻ ☆ Deep Learning for Computer Vision based Activity Recognition and Fall Detection of the Elderly: a Systematic Review
As the percentage of elderly people in developed countries increases worldwide, the healthcare of this collective is a worrying matter, especially if it includes the preservation of their autonomy. In this direction, many studies are being published on Ambient Assisted Living (AAL) systems, which help to reduce the preoccupations raised by the independent living of the elderly. In this study, a systematic review of the literature is presented on fall detection and Human Activity Recognition (HAR) for the elderly, as the two main tasks to solve to guarantee the safety of elderly people living alone. To address the current tendency to perform these two tasks, the review focuses on the use of Deep Learning (DL) based approaches on computer vision data. In addition, different collections of data like DL models, datasets or hardware (e.g. depth or thermal cameras) are gathered from the reviewed studies and provided for reference in future studies. Strengths and weaknesses of existing approaches are also discussed and, based on them, our recommendations for future works are provided.
♻ ☆ LLaVA-VSD: Large Language-and-Vision Assistant for Visual Spatial Description
Visual Spatial Description (VSD) aims to generate texts that describe the spatial relationships between objects within images. Traditional visual spatial relationship classification (VSRC) methods typically output the spatial relationship between two objects in an image, often neglecting world knowledge and lacking general language capabilities. In this paper, we propose a Large Language-and-Vision Assistant for Visual Spatial Description, named LLaVA-VSD, which is designed for the classification, description, and open-ended description of visual spatial relationships. Specifically, the model first constructs a VSD instruction-following dataset using given figure-caption pairs for the three tasks. It then employs LoRA to fine-tune a Large Language and Vision Assistant for VSD, which has 13 billion parameters and supports high-resolution images. Finally, a large language model (Qwen-2) is used to refine the generated sentences, enhancing their diversity and accuracy. LLaVA-VSD demonstrates excellent multimodal conversational capabilities and can follow open-ended instructions to assist with inquiries about object relationships in images.
comment: We have discovered a significant error in the paper that affects the main conclusions. To ensure the accuracy of our research, we have decided to withdraw this paper and will resubmit it after making the necessary corrections
♻ ☆ Solid Waste Detection, Monitoring and Mapping in Remote Sensing Images: A Survey
The detection and characterization of illegal solid waste disposal sites are essential for environmental protection, particularly for mitigating pollution and health hazards. Improperly managed landfills contaminate soil and groundwater via rainwater infiltration, posing threats to both animals and humans. Traditional landfill identification approaches, such as on-site inspections, are time-consuming and expensive. Remote sensing is a cost-effective solution for the identification and monitoring of solid waste disposal sites that enables broad coverage and repeated acquisitions over time. Earth Observation (EO) satellites, equipped with an array of sensors and imaging capabilities, have been providing high-resolution data for several decades. Researchers proposed specialized techniques that leverage remote sensing imagery to perform a range of tasks such as waste site detection, dumping site monitoring, and assessment of suitable locations for new landfills. This review aims to provide a detailed illustration of the most relevant proposals for the detection and monitoring of solid waste sites by describing and comparing the approaches, the implemented techniques, and the employed data. Furthermore, since the data sources are of the utmost importance for developing an effective solid waste detection model, a comprehensive overview of the satellites and publicly available data sets is presented. Finally, this paper identifies the open issues in the state-of-the-art and discusses the relevant research directions for reducing the costs and improving the effectiveness of novel solid waste detection methods.
♻ ☆ TokenPacker: Efficient Visual Projector for Multimodal LLM
The visual projector serves as an essential bridge between the visual encoder and the Large Language Model (LLM) in a Multimodal LLM (MLLM). Typically, MLLMs adopt a simple MLP to preserve all visual contexts via one-to-one transformation. However, the visual tokens are redundant and can be considerably increased when dealing with high-resolution images, impairing the efficiency of MLLMs significantly. Some recent works have introduced resampler or abstractor to reduce the number of resulting visual tokens. Unfortunately, they fail to capture finer details and undermine the visual reasoning capabilities of MLLMs. In this work, we propose a novel visual projector, which adopts a coarse-to-fine scheme to inject the enriched characteristics to generate the condensed visual tokens. In specific, we first interpolate the visual features as a low-resolution point query, providing the overall visual representation as the foundation. Then, we introduce a region-to-point injection module that utilizes high-resolution, multi-level region-based cues as fine-grained reference keys and values, allowing them to be fully absorbed within the corresponding local context region. This step effectively updates the coarse point query, transforming it into an enriched one for the subsequent LLM reasoning. Extensive experiments demonstrate that our approach compresses the visual tokens by 75%~89%, while achieves comparable or even better performance across diverse benchmarks with significantly higher efficiency. The source codes can be found at https://github.com/CircleRadon/TokenPacker.
comment: 16 pages, Codes:https://github.com/CircleRadon/TokenPacker
♻ ☆ Unveiling the Human-like Similarities of Automatic Facial Expression Recognition: An Empirical Exploration through Explainable AI
Facial expression recognition is vital for human behavior analysis, and deep learning has enabled models that can outperform humans. However, it is unclear how closely they mimic human processing. This study aims to explore the similarity between deep neural networks and human perception by comparing twelve different networks, including both general object classifiers and FER-specific models. We employ an innovative global explainable AI method to generate heatmaps, revealing crucial facial regions for the twelve networks trained on six facial expressions. We assess these results both quantitatively and qualitatively, comparing them to ground truth masks based on Friesen and Ekman's description and among them. We use Intersection over Union (IoU) and normalized correlation coefficients for comparisons. We generate 72 heatmaps to highlight critical regions for each expression and architecture. Qualitatively, models with pre-trained weights show more similarity in heatmaps compared to those without pre-training. Specifically, eye and nose areas influence certain facial expressions, while the mouth is consistently important across all models and expressions. Quantitatively, we find low average IoU values (avg. 0.2702) across all expressions and architectures. The best-performing architecture averages 0.3269, while the worst-performing one averages 0.2066. Dendrograms, built with the normalized correlation coefficient, reveal two main clusters for most expressions: models with pre-training and models without pre-training. Findings suggest limited alignment between human and AI facial expression recognition, with network architectures influencing the similarity, as similar architectures prioritize similar facial regions.
comment: Multimed Tools Appl (2024)
♻ ☆ DocLayLLM: An Efficient and Effective Multi-modal Extension of Large Language Models for Text-rich Document Understanding
Text-rich document understanding (TDU) refers to analyzing and comprehending documents containing substantial textual content. With the rapid evolution of large language models (LLMs), they have been widely leveraged for TDU due to their remarkable versatility and generalization. In this paper, we introduce DocLayLLM, an efficient and effective multi-modal extension of LLMs specifically designed for TDU. By integrating visual patch tokens and 2D positional tokens into LLMs and encoding the document content using the LLMs themselves, we fully take advantage of the document comprehension capability of LLMs and enhance their perception of OCR information. We have also deeply considered the role of the chain-of-thought (CoT) and innovatively proposed the techniques of CoT Pre-training and CoT Annealing. Our DocLayLLM can achieve remarkable performances with lightweight training settings, showcasing its efficiency and effectiveness. Experimental results demonstrate that our DocLayLLM surpasses existing OCR-dependent methods and also outperforms OCR-free competitors.
♻ ☆ Beyond Uniform Query Distribution: Key-Driven Grouped Query Attention
The Transformer architecture has revolutionized deep learning through its Self-Attention mechanism, which effectively captures contextual information. However, the memory footprint of Self-Attention presents significant challenges for long-sequence tasks. Grouped Query Attention (GQA) addresses this issue by grouping queries and mean-pooling the corresponding key-value heads - reducing the number of overall parameters and memory requirements in a flexible manner without adversely compromising model accuracy. In this work, we introduce enhancements to GQA, focusing on two novel approaches that deviate from the static nature of grouping: Key-Distributed GQA (KDGQA) and Dynamic Key-Distributed GQA (DGQA), which leverage information from the norms of the key heads to inform query allocation. Specifically, KDGQA looks at the ratios of the norms of the key heads during each forward pass, while DGQA examines the ratios of the norms as they evolve through training. Additionally, we present Perturbed GQA (PGQA) as a case-study, which introduces variability in (static) group formation via subtracting noise from the attention maps. Our experiments with up-trained Vision Transformers, for Image Classification on datasets such as CIFAR-10, CIFAR-100, Food101, and Tiny ImageNet, demonstrate the promise of these variants in improving upon the original GQA through more informed and adaptive grouping mechanisms: specifically ViT-L experiences accuracy gains of up to 8% when utilizing DGQA in comparison to GQA and other variants. We further analyze the impact of the number of Key-Value Heads on performance, underscoring the importance of utilizing query-key affinities. Code is available on GitHub.
comment: 11 pages, 9 figures
♻ ☆ Adapting Segment Anything Model to Multi-modal Salient Object Detection with Semantic Feature Fusion Guidance
Although most existing multi-modal salient object detection (SOD) methods demonstrate effectiveness through training models from scratch, the limited multi-modal data hinders these methods from reaching optimality. In this paper, we propose a novel framework to explore and exploit the powerful feature representation and zero-shot generalization ability of the pre-trained Segment Anything Model (SAM) for multi-modal SOD. Despite serving as a recent vision fundamental model, driving the class-agnostic SAM to comprehend and detect salient objects accurately is non-trivial, especially in challenging scenes. To this end, we develop \underline{SAM} with se\underline{m}antic f\underline{e}ature fu\underline{s}ion guidanc\underline{e} (Sammese), which incorporates multi-modal saliency-specific knowledge into SAM to adapt SAM to multi-modal SOD tasks. However, it is difficult for SAM trained on single-modal data to directly mine the complementary benefits of multi-modal inputs and comprehensively utilize them to achieve accurate saliency prediction.To address these issues, we first design a multi-modal complementary fusion module to extract robust multi-modal semantic features by integrating information from visible and thermal or depth image pairs. Then, we feed the extracted multi-modal semantic features into both the SAM image encoder and mask decoder for fine-tuning and prompting, respectively. Specifically, in the image encoder, a multi-modal adapter is proposed to adapt the single-modal SAM to multi-modal information. In the mask decoder, a semantic-geometric prompt generation strategy is proposed to produce corresponding embeddings with various saliency cues. Extensive experiments on both RGB-D and RGB-T SOD benchmarks show the effectiveness of the proposed framework.
comment: 10 pages, 9 figures
♻ ☆ HAIR: Hypernetworks-based All-in-One Image Restoration
Image restoration aims to recover a high-quality clean image from its degraded version. Recent progress in image restoration has demonstrated the effectiveness of All-in-One image restoration models in addressing various degradations simultaneously. However, these existing methods typically utilize the same parameters to tackle images with different degradation types, thus forcing the model to balance the performance between different tasks and limiting its performance on each task. To alleviate this issue, we propose HAIR, a \textbf{H}ypernetworks-based \textbf{A}ll-in-One \textbf{I}mage \textbf{R}estoration method that dynamically generates parameters based on input images. Specifically, HAIR consists of two main components, i.e., Classifier and Hyper Selecting Net (HSN). The Classifier is a simple image classification network used to generate a Global Information Vector (GIV) that contains the degradation information of the input image, and the HSN is a simple fully-connected neural network that receives the GIV and outputs parameters for the corresponding modules. Extensive experiments demonstrate that HAIR can significantly improve the performance of existing image restoration models in a plug-and-play manner, both in single-task and all-in-one settings. Notably, our innovative model, Res-HAIR, which integrates HAIR into the well-known Restormer, can obtain superior or comparable performance compared with current state-of-the-art methods. Moreover, we theoretically demonstrate that our proposed HAIR requires fewer parameters in contrast to the prevalent All-in-One methodologies. The code is available at \textcolor{blue}{\href{https://github.com/toummHus/HAIR}{https://github.com/toummHus/HAIR}.}
comment: 16 pages
♻ ☆ Boost Your NeRF: A Model-Agnostic Mixture of Experts Framework for High Quality and Efficient Rendering ECCV 2024
Since the introduction of NeRFs, considerable attention has been focused on improving their training and inference times, leading to the development of Fast-NeRFs models. Despite demonstrating impressive rendering speed and quality, the rapid convergence of such models poses challenges for further improving reconstruction quality. Common strategies to improve rendering quality involves augmenting model parameters or increasing the number of sampled points. However, these computationally intensive approaches encounter limitations in achieving significant quality enhancements. This study introduces a model-agnostic framework inspired by Sparsely-Gated Mixture of Experts to enhance rendering quality without escalating computational complexity. Our approach enables specialization in rendering different scene components by employing a mixture of experts with varying resolutions. We present a novel gate formulation designed to maximize expert capabilities and propose a resolution-based routing technique to effectively induce sparsity and decompose scenes. Our work significantly improves reconstruction quality while maintaining competitive performance.
comment: The paper has been accepted to the ECCV 2024 conference
♻ ☆ Enhancing Quantitative Image Synthesis through Pretraining and Resolution Scaling for Bone Mineral Density Estimation from a Plain X-ray Image MICCAI
While most vision tasks are essentially visual in nature (for recognition), some important tasks, especially in the medical field, also require quantitative analysis (for quantification) using quantitative images. Unlike in visual analysis, pixel values in quantitative images correspond to physical metrics measured by specific devices (e.g., a depth image). However, recent work has shown that it is sometimes possible to synthesize accurate quantitative values from visual ones (e.g., depth from visual cues or defocus). This research aims to improve quantitative image synthesis (QIS) by exploring pretraining and image resolution scaling. We propose a benchmark for evaluating pretraining performance using the task of QIS-based bone mineral density (BMD) estimation from plain X-ray images, where the synthesized quantitative image is used to derive BMD. Our results show that appropriate pretraining can improve QIS performance, significantly raising the correlation of BMD estimation from 0.820 to 0.898, while others do not help or even hinder it. Scaling-up the resolution can further boost the correlation up to 0.923, a significant enhancement over conventional methods. Future work will include exploring more pretraining strategies and validating them on other image synthesis tasks.
comment: SASHIMI, 2024 (MICCAI workshop). 13 pages, 3 figures
♻ ☆ NOVUM: Neural Object Volumes for Robust Object Classification ECCV 2024
Discriminative models for object classification typically learn image-based representations that do not capture the compositional and 3D nature of objects. In this work, we show that explicitly integrating 3D compositional object representations into deep networks for image classification leads to a largely enhanced generalization in out-of-distribution scenarios. In particular, we introduce a novel architecture, referred to as NOVUM, that consists of a feature extractor and a neural object volume for every target object class. Each neural object volume is a composition of 3D Gaussians that emit feature vectors. This compositional object representation allows for a highly robust and fast estimation of the object class by independently matching the features of the 3D Gaussians of each category to features extracted from an input image. Additionally, the object pose can be estimated via inverse rendering of the corresponding neural object volume. To enable the classification of objects, the neural features at each 3D Gaussian are trained discriminatively to be distinct from (i) the features of 3D Gaussians in other categories, (ii) features of other 3D Gaussians of the same object, and (iii) the background features. Our experiments show that NOVUM offers intriguing advantages over standard architectures due to the 3D compositional structure of the object representation, namely: (1) An exceptional robustness across a spectrum of real-world and synthetic out-of-distribution shifts and (2) an enhanced human interpretability compared to standard models, all while maintaining real-time inference and a competitive accuracy on in-distribution data.
comment: 14 pages, 4 figures, accepted at ECCV 2024, code is accessible at https://github.com/GenIntel/NOVUM
♻ ☆ Brain3D: Generating 3D Objects from fMRI
Understanding the hidden mechanisms behind human's visual perception is a fundamental question in neuroscience. To that end, investigating into the neural responses of human mind activities, such as functional Magnetic Resonance Imaging (fMRI), has been a significant research vehicle. However, analyzing fMRI signals is challenging, costly, daunting, and demanding for professional training. Despite remarkable progress in fMRI analysis, existing approaches are limited to generating 2D images and far away from being biologically meaningful and practically useful. Under this insight, we propose to generate visually plausible and functionally more comprehensive 3D outputs decoded from brain signals, enabling more sophisticated modeling of fMRI data. Conceptually, we reformulate this task as a {\em fMRI conditioned 3D object generation} problem. We design a novel 3D object representation learning method, Brain3D, that takes as input the fMRI data of a subject who was presented with a 2D image, and yields as output the corresponding 3D object images. The key capabilities of this model include tackling the noises with high-level semantic signals and a two-stage architecture design for progressive high-level information integration. Extensive experiments validate the superior capability of our model over previous state-of-the-art 3D object generation methods. Importantly, we show that our model captures the distinct functionalities of each region of human vision system as well as their intricate interplay relationships, aligning remarkably with the established discoveries in neuroscience. Further, preliminary evaluations indicate that Brain3D can successfully identify the disordered brain regions in simulated scenarios, such as V1, V2, V3, V4, and the medial temporal lobe (MTL) within the human visual system. Our data and code will be available at https://brain-3d.github.io/.
comment: 20 pages, 11 figures, project page: https://brain-3d.github.io/
♻ ☆ DualAnoDiff: Dual-Interrelated Diffusion Model for Few-Shot Anomaly Image Generation
The performance of anomaly inspection in industrial manufacturing is constrained by the scarcity of anomaly data. To overcome this challenge, researchers have started employing anomaly generation approaches to augment the anomaly dataset. However, existing anomaly generation methods suffer from limited diversity in the generated anomalies and struggle to achieve a seamless blending of this anomaly with the original image. In this paper, we overcome these challenges from a new perspective, simultaneously generating a pair of the overall image and the corresponding anomaly part. We propose DualAnoDiff, a novel diffusion-based few-shot anomaly image generation model, which can generate diverse and realistic anomaly images by using a dual-interrelated diffusion model, where one of them is employed to generate the whole image while the other one generates the anomaly part. Moreover, we extract background and shape information to mitigate the distortion and blurriness phenomenon in few-shot image generation. Extensive experiments demonstrate the superiority of our proposed model over state-of-the-art methods in terms of both realism and diversity. Overall, our approach significantly improves the performance of downstream anomaly detection tasks, including anomaly detection, anomaly localization, and anomaly classification tasks.
comment: Code: https://github.com/yinyjin/DualAnoDiff
♻ ☆ Lightweight High-Speed Photography Built on Coded Exposure and Implicit Neural Representation of Videos
The demand for compact cameras capable of recording high-speed scenes with high resolution is steadily increasing. However, achieving such capabilities often entails high bandwidth requirements, resulting in bulky, heavy systems unsuitable for low-capacity platforms. To address this challenge, leveraging a coded exposure setup to encode a frame sequence into a blurry snapshot and subsequently retrieve the latent sharp video presents a lightweight solution. Nevertheless, restoring motion from blur remains a formidable challenge due to the inherent ill-posedness of motion blur decomposition, the intrinsic ambiguity in motion direction, and the diverse motions present in natural videos. In this study, we propose a novel approach to address these challenges by combining the classical coded exposure imaging technique with the emerging implicit neural representation for videos. We strategically embed motion direction cues into the blurry image during the imaging process. Additionally, we develop a novel implicit neural representation based blur decomposition network to sequentially extract the latent video frames from the blurry image, leveraging the embedded motion direction cues. To validate the effectiveness and efficiency of our proposed framework, we conduct extensive experiments using benchmark datasets and real-captured blurry images. The results demonstrate that our approach significantly outperforms existing methods in terms of both quality and flexibility. The code for our work is available at .https://github.com/zhihongz/BDINR
comment: Accepted by IJCV
♻ ☆ Structural Attention: Rethinking Transformer for Unpaired Medical Image Synthesis MICCAI
Unpaired medical image synthesis aims to provide complementary information for an accurate clinical diagnostics, and address challenges in obtaining aligned multi-modal medical scans. Transformer-based models excel in imaging translation tasks thanks to their ability to capture long-range dependencies. Although effective in supervised training settings, their performance falters in unpaired image synthesis, particularly in synthesizing structural details. This paper empirically demonstrates that, lacking strong inductive biases, Transformer can converge to non-optimal solutions in the absence of paired data. To address this, we introduce UNet Structured Transformer (UNest), a novel architecture incorporating structural inductive biases for unpaired medical image synthesis. We leverage the foundational Segment-Anything Model to precisely extract the foreground structure and perform structural attention within the main anatomy. This guides the model to learn key anatomical regions, thus improving structural synthesis under the lack of supervision in unpaired training. Evaluated on two public datasets, spanning three modalities, i.e., MR, CT, and PET, UNest improves recent methods by up to 19.30% across six medical image synthesis tasks. Our code is released at https://github.com/HieuPhan33/MICCAI2024-UNest.
comment: MICCAI version before camera ready
♻ ☆ xGen-MM (BLIP-3): A Family of Open Large Multimodal Models
This report introduces xGen-MM (also known as BLIP-3), a framework for developing Large Multimodal Models (LMMs). The framework comprises meticulously curated datasets, a training recipe, model architectures, and a resulting suite of LMMs. xGen-MM, short for xGen-MultiModal, expands the Salesforce xGen initiative on foundation AI models. Our models undergo rigorous evaluation across a range of tasks, including both single and multi-image benchmarks. Our pre-trained base model exhibits strong in-context learning capabilities and the instruction-tuned model demonstrates competitive performance among open-source LMMs with similar model sizes. In addition, we introduce a safety-tuned model with DPO, aiming to mitigate harmful behaviors such as hallucinations and improve safety. We open-source our models, curated large-scale datasets, and our fine-tuning codebase to facilitate further advancements in LMM research. Associated resources will be available on our project page above.
♻ ☆ Classification Matters: Improving Video Action Detection with Class-Specific Attention ECCV 2024
Video action detection (VAD) aims to detect actors and classify their actions in a video. We figure that VAD suffers more from classification rather than localization of actors. Hence, we analyze how prevailing methods form features for classification and find that they prioritize actor regions, yet often overlooking the essential contextual information necessary for accurate classification. Accordingly, we propose to reduce the bias toward actor and encourage paying attention to the context that is relevant to each action class. By assigning a class-dedicated query to each action class, our model can dynamically determine where to focus for effective classification. The proposed model demonstrates superior performance on three challenging benchmarks with significantly fewer parameters and less computation.
comment: 31 pages, accepted to ECCV 2024 (oral)
♻ ☆ Drone Referring Localization: An Efficient Heterogeneous Spatial Feature Interaction Method For UAV Self-Localization
Image retrieval (IR) has emerged as a promising approach for self-localization in unmanned aerial vehicles (UAVs). However, IR-based methods face several challenges: 1) Pre- and post-processing incur significant computational and storage overhead; 2) The lack of interaction between dual-source features impairs precise spatial perception. In this paper, we propose an efficient heterogeneous spatial feature interaction method, termed Drone Referring Localization (DRL), which aims to localize UAV-view images within satellite imagery. Unlike conventional methods that treat different data sources in isolation, followed by cosine similarity computations, DRL facilitates the learnable interaction of heterogeneous features. To implement the proposed DRL, we design two transformer-based frameworks, Post-Fusion and Mix-Fusion, enabling end-to-end training and inference. Furthermore, we introduce random scale cropping and weight balance loss techniques to augment paired data and optimize the balance between positive and negative sample weights. Additionally, we construct a new dataset, UL14, and establish a benchmark tailored to the DRL framework. Compared to traditional IR methods, DRL achieves superior localization accuracy (MA@20 +9.4\%) while significantly reducing computational time (1/7) and storage overhead (1/3). The dataset and code will be made publicly available. The dataset and code are available at \url{https://github.com/Dmmm1997/DRL} .
comment: 15 pages, 14 figures
♻ ☆ MolNexTR: A Generalized Deep Learning Model for Molecular Image Recognition
In the field of chemical structure recognition, the task of converting molecular images into machine-readable data formats such as SMILES string stands as a significant challenge, primarily due to the varied drawing styles and conventions prevalent in chemical literature. To bridge this gap, we proposed MolNexTR, a novel image-to-graph deep learning model that collaborates to fuse the strengths of ConvNext, a powerful Convolutional Neural Network variant, and Vision-TRansformer. This integration facilitates a more detailed extraction of both local and global features from molecular images. MolNexTR can predict atoms and bonds simultaneously and understand their layout rules. It also excels at flexibly integrating symbolic chemistry principles to discern chirality and decipher abbreviated structures. We further incorporate a series of advanced algorithms, including an improved data augmentation module, an image contamination module, and a post-processing module for getting the final SMILES output. These modules cooperate to enhance the model's robustness to diverse styles of molecular images found in real literature. In our test sets, MolNexTR has demonstrated superior performance, achieving an accuracy rate of 81-97%, marking a significant advancement in the domain of molecular structure recognition.
♻ ☆ Phase Matching for Out-of-Distribution Generalization
The Fourier transform, an explicit decomposition method for visual signals, has been employed to explain the out-of-distribution generalization behaviors of Deep Neural Networks (DNNs). Previous studies indicate that the amplitude spectrum is susceptible to the disturbance caused by distribution shifts, whereas the phase spectrum preserves highly-structured spatial information that is crucial for robust visual representation learning. Inspired by this insight, this paper is dedicated to clarifying the relationships between Domain Generalization (DG) and the frequency components. Specifically, we provide distribution analysis and empirical experiments for the frequency components. Based on these observations, we propose a Phase Matching approach, termed PhaMa, to address DG problems. To this end, PhaMa introduces perturbations on the amplitude spectrum and establishes spatial relationships to match the phase components with patch contrastive learning. Experiments on multiple benchmarks demonstrate that our proposed method achieves state-of-the-art performance in domain generalization and out-of-distribution robustness tasks. Beyond vanilla analysis and experiments, we further clarify the relationships between the Fourier components and DG problems by introducing a Fourier-based Structural Causal Model (SCM).
♻ ☆ SGNet: Salient Geometric Network for Point Cloud Registration
Point Cloud Registration (PCR) is a critical and challenging task in computer vision. One of the primary difficulties in PCR is identifying salient and meaningful points that exhibit consistent semantic and geometric properties across different scans. Previous methods have encountered challenges with ambiguous matching due to the similarity among patch blocks throughout the entire point cloud and the lack of consideration for efficient global geometric consistency. To address these issues, we propose a new framework that includes several novel techniques. Firstly, we introduce a semantic-aware geometric encoder that combines object-level and patch-level semantic information. This encoder significantly improves registration recall by reducing ambiguity in patch-level superpoint matching. Additionally, we incorporate a prior knowledge approach that utilizes an intrinsic shape signature to identify salient points. This enables us to extract the most salient super points and meaningful dense points in the scene. Secondly, we introduce an innovative transformer that encodes High-Order (HO) geometric features. These features are crucial for identifying salient points within initial overlap regions while considering global high-order geometric consistency. To optimize this high-order transformer further, we introduce an anchor node selection strategy. By encoding inter-frame triangle or polyhedron consistency features based on these anchor nodes, we can effectively learn high-order geometric features of salient super points. These high-order features are then propagated to dense points and utilized by a Sinkhorn matching module to identify key correspondences for successful registration. In our experiments conducted on well-known datasets such as 3DMatch/3DLoMatch and KITTI, our approach has shown promising results, highlighting the effectiveness of our novel method.
♻ ☆ Fine-Grained Building Function Recognition from Street-View Images via Geometry-Aware Semi-Supervised Learning
In this work, we propose a geometry-aware semi-supervised method for fine-grained building function recognition. This method leverages the geometric relationships between multi-source data to improve the accuracy of pseudo labels in semi-supervised learning, extending the task's scope and making it applicable to cross-categorization systems of building function recognition. Firstly, we design an online semi-supervised pre-training stage, which facilitates the precise acquisition of building facade location information in street-view images. In the second stage, we propose a geometry-aware coarse annotation generation module. This module effectively combines GIS data and street-view data based on the geometric relationships, improving the accuracy of pseudo annotations. In the third stage, we combine the newly generated coarse annotations with the existing labeled dataset to achieve fine-grained functional recognition of buildings across multiple cities at a large scale. Extensive experiments demonstrate that our proposed framework exhibits superior performance in fine-grained functional recognition of buildings. Within the same categorization system, it achieves improvements of 7.6% and 4.8% compared to fully-supervised methods and state-of-the-art semi-supervised methods, respectively. Additionally, our method also performs well in cross-city tasks, i.e., extending the model trained on OmniCity (New York) to new areas (i.e., Los Angeles and Boston). This study provides a novel solution for the fine-grained function recognition of large-scale buildings across multiple cities, offering essential data for understanding urban infrastructure planning, human activity patterns, and the interactions between humans and buildings.
comment: This paper is currently under review
♻ ☆ Multi-weather Cross-view Geo-localization Using Denoising Diffusion Models ACM MM24
Cross-view geo-localization in GNSS-denied environments aims to determine an unknown location by matching drone-view images with the correct geo-tagged satellite-view images from a large gallery. Recent research shows that learning discriminative image representations under specific weather conditions can significantly enhance performance. However, the frequent occurrence of unseen extreme weather conditions hinders progress. This paper introduces MCGF, a Multi-weather Cross-view Geo-localization Framework designed to dynamically adapt to unseen weather conditions. MCGF establishes a joint optimization between image restoration and geo-localization using denoising diffusion models. For image restoration, MCGF incorporates a shared encoder and a lightweight restoration module to help the backbone eliminate weather-specific information. For geo-localization, MCGF uses EVA-02 as a backbone for feature extraction, with cross-entropy loss for training and cosine distance for testing. Extensive experiments on University160k-WX demonstrate that MCGF achieves competitive results for geo-localization in varying weather conditions.
comment: Accepted by ACM MM24 workshop
♻ ☆ VHAKG: A Multi-modal Knowledge Graph Based on Synchronized Multi-view Videos of Daily Activities CIKM2024
Multi-modal knowledge graphs (MMKGs), which ground various non-symbolic data (e.g., images and videos) into symbols, have attracted attention as resources enabling knowledge processing and machine learning across modalities. However, the construction of MMKGs for videos consisting of multiple events, such as daily activities, is still in the early stages. In this paper, we construct an MMKG based on synchronized multi-view simulated videos of daily activities. Besides representing the content of daily life videos as event-centric knowledge, our MMKG also includes frame-by-frame fine-grained changes, such as bounding boxes within video frames. In addition, we provide support tools for querying our MMKG. As an application example, we demonstrate that our MMKG facilitates benchmarking vision-language models by providing the necessary vision-language datasets for a tailored task.
comment: 5 pages, 4 figures, accepted by CIKM2024 Resource Track
♻ ☆ Customize-A-Video: One-Shot Motion Customization of Text-to-Video Diffusion Models ECCV 2024
Image customization has been extensively studied in text-to-image (T2I) diffusion models, leading to impressive outcomes and applications. With the emergence of text-to-video (T2V) diffusion models, its temporal counterpart, motion customization, has not yet been well investigated. To address the challenge of one-shot video motion customization, we propose Customize-A-Video that models the motion from a single reference video and adapts it to new subjects and scenes with both spatial and temporal varieties. It leverages low-rank adaptation (LoRA) on temporal attention layers to tailor the pre-trained T2V diffusion model for specific motion modeling. To disentangle the spatial and temporal information during training, we introduce a novel concept of appearance absorbers that detach the original appearance from the reference video prior to motion learning. The proposed modules are trained in a staged pipeline and inferred in a plug-and-play fashion, enabling easy extensions to various downstream tasks such as custom video generation and editing, video appearance customization and multiple motion combination. Our project page can be found at https://customize-a-video.github.io.
comment: Accepted by ECCV 2024. Project page: https://customize-a-video.github.io
♻ ☆ AutoInst: Automatic Instance-Based Segmentation of LiDAR 3D Scans IROS
Recently, progress in acquisition equipment such as LiDAR sensors has enabled sensing increasingly spacious outdoor 3D environments. Making sense of such 3D acquisitions requires fine-grained scene understanding, such as constructing instance-based 3D scene segmentations. Commonly, a neural network is trained for this task; however, this requires access to a large, densely annotated dataset, which is widely known to be challenging to obtain. To address this issue, in this work we propose to predict instance segmentations for 3D scenes in an unsupervised way, without relying on ground-truth annotations. To this end, we construct a learning framework consisting of two components: (1) a pseudo-annotation scheme for generating initial unsupervised pseudo-labels; and (2) a self-training algorithm for instance segmentation to fit robust, accurate instances from initial noisy proposals. To enable generating 3D instance mask proposals, we construct a weighted proxy-graph by connecting 3D points with edges integrating multi-modal image- and point-based self-supervised features, and perform graph-cuts to isolate individual pseudo-instances. We then build on a state-of-the-art point-based architecture and train a 3D instance segmentation model, resulting in significant refinement of initial proposals. To scale to arbitrary complexity 3D scenes, we design our algorithm to operate on local 3D point chunks and construct a merging step to generate scene-level instance segmentations. Experiments on the challenging SemanticKITTI benchmark demonstrate the potential of our approach, where it attains 13.3% higher Average Precision and 9.1% higher F1 score compared to the best-performing baseline. The code will be made publicly available at https://github.com/artonson/autoinst.
comment: 8 pages, 7 figures, to be published in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2024
♻ ☆ AnomalousPatchCore: Exploring the Use of Anomalous Samples in Industrial Anomaly Detection ECCV
Visual inspection, or industrial anomaly detection, is one of the most common quality control types in manufacturing. The task is to identify the presence of an anomaly given an image, e.g., a missing component on an image of a circuit board, for subsequent manual inspection. While industrial anomaly detection has seen a surge in recent years, most anomaly detection methods still utilize knowledge only from normal samples, failing to leverage the information from the frequently available anomalous samples. Additionally, they heavily rely on very general feature extractors pre-trained on common image classification datasets. In this paper, we address these shortcomings and propose the new anomaly detection system AnomalousPatchCore~(APC) based on a feature extractor fine-tuned with normal and anomalous in-domain samples and a subsequent memory bank for identifying unusual features. To fine-tune the feature extractor in APC, we propose three auxiliary tasks that address the different aspects of anomaly detection~(classification vs. localization) and mitigate the effect of the imbalance between normal and anomalous samples. Our extensive evaluation on the MVTec dataset shows that APC outperforms state-of-the-art systems in detecting anomalies, which is especially important in industrial anomaly detection given the subsequent manual inspection. In detailed ablation studies, we further investigate the properties of our APC.
comment: Accepted at the 2nd workshop on Vision-based InduStrial InspectiON (VISION) @ ECCV
♻ ☆ MMASD+: A Novel Dataset for Privacy-Preserving Behavior Analysis of Children with Autism Spectrum Disorder
Autism spectrum disorder (ASD) is characterized by significant challenges in social interaction and comprehending communication signals. Recently, therapeutic interventions for ASD have increasingly utilized Deep learning powered-computer vision techniques to monitor individual progress over time. These models are trained on private, non-public datasets from the autism community, creating challenges in comparing results across different models due to privacy-preserving data-sharing issues. This work introduces MMASD+, an enhanced version of the novel open-source dataset called Multimodal ASD (MMASD). MMASD+ consists of diverse data modalities, including 3D-Skeleton, 3D Body Mesh, and Optical Flow data. It integrates the capabilities of Yolov8 and Deep SORT algorithms to distinguish between the therapist and children, addressing a significant barrier in the original dataset. Additionally, a Multimodal Transformer framework is proposed to predict 11 action types and the presence of ASD. This framework achieves an accuracy of 95.03% for predicting action types and 96.42% for predicting ASD presence, demonstrating over a 10% improvement compared to models trained on single data modalities. These findings highlight the advantages of integrating multiple data modalities within the Multimodal Transformer framework.
♻ ☆ Field-of-View Extension for Brain Diffusion MRI via Deep Generative Models
Purpose: In diffusion MRI (dMRI), the volumetric and bundle analyses of whole-brain tissue microstructure and connectivity can be severely impeded by an incomplete field-of-view (FOV). This work aims to develop a method for imputing the missing slices directly from existing dMRI scans with an incomplete FOV. We hypothesize that the imputed image with complete FOV can improve the whole-brain tractography for corrupted data with incomplete FOV. Therefore, our approach provides a desirable alternative to discarding the valuable dMRI data, enabling subsequent tractography analyses that would otherwise be challenging or unattainable with corrupted data. Approach: We propose a framework based on a deep generative model that estimates the absent brain regions in dMRI scans with incomplete FOV. The model is capable of learning both the diffusion characteristics in diffusion-weighted images (DWI) and the anatomical features evident in the corresponding structural images for efficiently imputing missing slices of DWI outside of incomplete FOV. Results: For evaluating the imputed slices, on the WRAP dataset the proposed framework achieved PSNRb0=22.397, SSIMb0=0.905, PSNRb1300=22.479, SSIMb1300=0.893; on the NACC dataset it achieved PSNRb0=21.304, SSIMb0=0.892, PSNRb1300=21.599, SSIMb1300= 0.877. The proposed framework improved the tractography accuracy, as demonstrated by an increased average Dice score for 72 tracts (p < 0.001) on both the WRAP and NACC datasets. Conclusions: Results suggest that the proposed framework achieved sufficient imputation performance in dMRI data with incomplete FOV for improving whole-brain tractography, thereby repairing the corrupted data. Our approach achieved more accurate whole-brain tractography results with extended and complete FOV and reduced the uncertainty when analyzing bundles associated with Alzheimer's Disease.
comment: 20 pages, 11 figures
♻ ☆ Biomedical Image Segmentation: A Systematic Literature Review of Deep Learning Based Object Detection Methods
Biomedical image segmentation plays a vital role in diagnosis of diseases across various organs. Deep learning-based object detection methods are commonly used for such segmentation. There exists an extensive research in this topic. However, there is no standard review on this topic. Existing surveys often lack a standardized approach or focus on broader segmentation techniques. In this paper, we conducted a systematic literature review (SLR), collected and analysed 148 articles that explore deep learning object detection methods for biomedical image segmentation. We critically analyzed these methods, identified the key challenges, and discussed the future directions. From the selected articles we extracted the results including the deep learning models, targeted imaging modalities, targeted diseases, and the metrics for the analysis of the methods. The results have been presented in tabular and/or charted forms. The results are presented in three major categories including two stage detection models, one stage detection models and point-based detection models. Each article is individually analyzed along with its pros and cons. Finally, we discuss open challenges, potential benefits, and future research directions. This SLR aims to provide the research community with a quick yet deeper understanding of these segmentation models, ultimately facilitating the development of more powerful solutions for biomedical image analysis.
♻ ☆ PhysPart: Physically Plausible Part Completion for Interactable Objects
Interactable objects are ubiquitous in our daily lives. Recent advances in 3D generative models make it possible to automate the modeling of these objects, benefiting a range of applications from 3D printing to the creation of robot simulation environments. However, while significant progress has been made in modeling 3D shapes and appearances, modeling object physics, particularly for interactable objects, remains challenging due to the physical constraints imposed by inter-part motions. In this paper, we tackle the problem of physically plausible part completion for interactable objects, aiming to generate 3D parts that not only fit precisely into the object but also allow smooth part motions. To this end, we propose a diffusion-based part generation model that utilizes geometric conditioning through classifier-free guidance and formulates physical constraints as a set of stability and mobility losses to guide the sampling process. Additionally, we demonstrate the generation of dependent parts, paving the way toward sequential part generation for objects with complex part-whole hierarchies. Experimentally, we introduce a new metric for measuring physical plausibility based on motion success rates. Our model outperforms existing baselines over shape and physical metrics, especially those that do not adequately model physical constraints. We also demonstrate our applications in 3D printing, robot manipulation, and sequential part generation, showing our strength in realistic tasks with the demand for high physical plausibility.
♻ ☆ SurGen: Text-Guided Diffusion Model for Surgical Video Generation
Diffusion-based video generation models have made significant strides, producing outputs with improved visual fidelity, temporal coherence, and user control. These advancements hold great promise for improving surgical education by enabling more realistic, diverse, and interactive simulation environments. In this study, we introduce SurGen, a text-guided diffusion model tailored for surgical video synthesis, producing the highest resolution and longest duration videos among existing surgical video generation models. We validate the visual and temporal quality of the outputs using standard image and video generation metrics. Additionally, we assess their alignment to the corresponding text prompts through a deep learning classifier trained on surgical data. Our results demonstrate the potential of diffusion models to serve as valuable educational tools for surgical trainees.
♻ ☆ Interpretable Image Emotion Recognition: A Domain Adaptation Approach Using Facial Expressions
This paper proposes a feature-based domain adaptation technique for identifying emotions in generic images, encompassing both facial and non-facial objects, as well as non-human components. This approach addresses the challenge of the limited availability of pre-trained models and well-annotated datasets for Image Emotion Recognition (IER). Initially, a deep-learning-based Facial Expression Recognition (FER) system is developed, classifying facial images into discrete emotion classes. Maintaining the same network architecture, this FER system is then adapted to recognize emotions in generic images through the application of discrepancy loss, enabling the model to effectively learn IER features while classifying emotions into categories such as 'happy,' 'sad,' 'hate,' and 'anger.' Additionally, a novel interpretability method, Divide and Conquer based Shap (DnCShap), is introduced to elucidate the visual features most relevant for emotion recognition. The proposed IER system demonstrated emotion classification accuracies of 60.98% for the IAPSa dataset, 58.86% for the ArtPhoto dataset, 69.13% for the FI dataset, and 58.06% for the EMOTIC dataset. The system effectively identifies the important visual features leading to specific emotion classifications and provides detailed embedding plots to explain the predictions, enhancing the understanding and trust in AI-driven emotion recognition systems.
Information Retrieval
☆ Modeling and Analyzing the Influence of Non-Item Pages on Sequential Next-Item Prediction
Analyzing the sequence of historical interactions between users and items, sequential recommendation models learn user intent and make predictions about the next item of interest. Next to these item interactions, most systems also have interactions with pages not related to specific items, for example navigation pages, account pages, and pages for a specific category, which may provide additional insights into the user's interests. However, while there are several approaches to integrate additional information about items and users, the topic of integrating non-item pages has been less explored. We use the hypotheses testing framework HypTrails to show that there is indeed a relationship between these non-item pages and the items of interest and fill this gap by proposing various approaches of representing non-item pages (e.g, based on their content) to use them as an additional information source for the task of sequential next-item prediction. We create a synthetic dataset with non-item pages highly related to the subsequent item to show that the models are generally capable of learning from these interactions, and subsequently evaluate the improvements gained by including non-item pages in two real-world datasets. We adapt eight popular sequential recommender models, covering CNN-, RNN- and transformer-based architectures, to integrate non-item pages and investigate the capabilities of these models to leverage their information for next item prediction. We also analyze their behavior on noisy data and compare different item representation strategies. Our results show that non-item pages are a valuable source of information, but representing such a page well is the key to successfully leverage them. The inclusion of non-item pages can increase the performance for next-item prediction in all examined model architectures with a varying degree.
comment: 36 pages, 19 figures; Work in Progress
☆ Knowledge Navigator: LLM-guided Browsing Framework for Exploratory Search in Scientific Literature
The exponential growth of scientific literature necessitates advanced tools for effective knowledge exploration. We present Knowledge Navigator, a system designed to enhance exploratory search abilities by organizing and structuring the retrieved documents from broad topical queries into a navigable, two-level hierarchy of named and descriptive scientific topics and subtopics. This structured organization provides an overall view of the research themes in a domain, while also enabling iterative search and deeper knowledge discovery within specific subtopics by allowing users to refine their focus and retrieve additional relevant documents. Knowledge Navigator combines LLM capabilities with cluster-based methods to enable an effective browsing method. We demonstrate our approach's effectiveness through automatic and manual evaluations on two novel benchmarks, CLUSTREC-COVID and SCITOC. Our code, prompts, and benchmarks are made publicly available.
☆ Evaluating Named Entity Recognition Using Few-Shot Prompting with Large Language Models
This paper evaluates Few-Shot Prompting with Large Language Models for Named Entity Recognition (NER). Traditional NER systems rely on extensive labeled datasets, which are costly and time-consuming to obtain. Few-Shot Prompting or in-context learning enables models to recognize entities with minimal examples. We assess state-of-the-art models like GPT-4 in NER tasks, comparing their few-shot performance to fully supervised benchmarks. Results show that while there is a performance gap, large models excel in adapting to new entity types and domains with very limited data. We also explore the effects of prompt engineering, guided output format and context length on performance. This study underscores Few-Shot Learning's potential to reduce the need for large labeled datasets, enhancing NER scalability and accessibility.
comment: Github repo: https://github.com/GEODE-project/ner-llm
☆ Interactive Agents: Simulating Counselor-Client Psychological Counseling via Role-Playing LLM-to-LLM Interactions
Virtual counselors powered by large language models (LLMs) aim to create interactive support systems that effectively assist clients struggling with mental health challenges. To replicate counselor-client conversations, researchers have built an online mental health platform that allows professional counselors to provide clients with text-based counseling services for about an hour per session. Notwithstanding its effectiveness, challenges exist as human annotation is time-consuming, cost-intensive, privacy-protected, and not scalable. To address this issue and investigate the applicability of LLMs in psychological counseling conversation simulation, we propose a framework that employs two LLMs via role-playing for simulating counselor-client interactions. Our framework involves two LLMs, one acting as a client equipped with a specific and real-life user profile and the other playing the role of an experienced counselor, generating professional responses using integrative therapy techniques. We implement both the counselor and the client by zero-shot prompting the GPT-4 model. In order to assess the effectiveness of LLMs in simulating counselor-client interactions and understand the disparities between LLM- and human-generated conversations, we evaluate the synthetic data from various perspectives. We begin by assessing the client's performance through automatic evaluations. Next, we analyze and compare the disparities between dialogues generated by the LLM and those generated by professional counselors. Furthermore, we conduct extensive experiments to thoroughly examine the performance of our LLM-based counselor trained with synthetic interactive dialogues by benchmarking against state-of-the-art models for mental health.
☆ PDSR: A Privacy-Preserving Diversified Service Recommendation Method on Distributed Data
The last decade has witnessed a tremendous growth of service computing, while efficient service recommendation methods are desired to recommend high-quality services to users. It is well known that collaborative filtering is one of the most popular methods for service recommendation based on QoS, and many existing proposals focus on improving recommendation accuracy, i.e., recommending high-quality redundant services. Nevertheless, users may have different requirements on QoS, and hence diversified recommendation has been attracting increasing attention in recent years to fulfill users' diverse demands and to explore potential services. Unfortunately, the recommendation performances relies on a large volume of data (e.g., QoS data), whereas the data may be distributed across multiple platforms. Therefore, to enable data sharing across the different platforms for diversified service recommendation, we propose a Privacy-preserving Diversified Service Recommendation (PDSR) method. Specifically, we innovate in leveraging the Locality-Sensitive Hashing (LSH) mechanism such that privacy-preserved data sharing across different platforms is enabled to construct a service similarity graph. Based on the similarity graph, we propose a novel accuracy-diversity metric and design a $2$-approximation algorithm to select $K$ services to recommend by maximizing the accuracy-diversity measure. Extensive experiments on real datasets are conducted to verify the efficacy of our PDSR method.
☆ CAPER: Enhancing Career Trajectory Prediction using Temporal Knowledge Graph and Ternary Relationship
The problem of career trajectory prediction (CTP) aims to predict one's future employer or job position. While several CTP methods have been developed for this problem, we posit that none of these methods (1) jointly considers the mutual ternary dependency between three key units (i.e., user, position, and company) of a career and (2) captures the characteristic shifts of key units in career over time, leading to an inaccurate understanding of the job movement patterns in the labor market. To address the above challenges, we propose a novel solution, named as CAPER, that solves the challenges via sophisticated temporal knowledge graph (TKG) modeling. It enables the utilization of a graph-structured knowledge base with rich expressiveness, effectively preserving the changes in job movement patterns. Furthermore, we devise an extrapolated career reasoning task on TKG for a realistic evaluation. The experiments on a real-world career trajectory dataset demonstrate that CAPER consistently and significantly outperforms four baselines, two recent TKG reasoning methods, and five state-of-the-art CTP methods in predicting one's future companies and positions-i.e., on average, yielding 6.80% and 34.58% more accurate predictions, respectively.
☆ Lyrically Speaking: Exploring the Link Between Lyrical Emotions, Themes and Depression Risk
Lyrics play a crucial role in affecting and reinforcing emotional states by providing meaning and emotional connotations that interact with the acoustic properties of the music. Specific lyrical themes and emotions may intensify existing negative states in listeners and may lead to undesirable outcomes, especially in listeners with mood disorders such as depression. Hence, it is important for such individuals to be mindful of their listening strategies. In this study, we examine online music consumption of individuals at risk of depression in light of lyrical themes and emotions. Lyrics obtained from the listening histories of 541 Last.fm users, divided into At-Risk and No-Risk based on their mental well-being scores, were analyzed using natural language processing techniques. Statistical analyses of the results revealed that individuals at risk for depression prefer songs with lyrics associated with low valence and low arousal. Additionally, lyrics associated with themes of denial, self-reference, and ambivalence were preferred. In contrast, themes such as liberation, familiarity, and activity are not as favored. This study opens up the possibility of an approach to assessing depression risk from the digital footprint of individuals and potentially developing personalized recommendation systems.
comment: Accepted at the 25th International Society for Music Information Retrieval Conference (ISMIR) 2024, San Francisco, United States
☆ Efficient $k$-NN Search in IoT Data: Overlap Optimization in Tree-Based Indexing Structures
The proliferation of interconnected devices in the Internet of Things (IoT) has led to an exponential increase in data, commonly known as Big IoT Data. Efficient retrieval of this heterogeneous data demands a robust indexing mechanism for effective organization. However, a significant challenge remains: the overlap in data space partitions during index construction. This overlap increases node access during search and retrieval, resulting in higher resource consumption, performance bottlenecks, and impedes system scalability. To address this issue, we propose three innovative heuristics designed to quantify and strategically reduce data space partition overlap. The volume-based method (VBM) offers a detailed assessment by calculating the intersection volume between partitions, providing deeper insights into spatial relationships. The distance-based method (DBM) enhances efficiency by using the distance between partition centers and radii to evaluate overlap, offering a streamlined yet accurate approach. Finally, the object-based method (OBM) provides a practical solution by counting objects across multiple partitions, delivering an intuitive understanding of data space dynamics. Experimental results demonstrate the effectiveness of these methods in reducing search time, underscoring their potential to improve data space partitioning and enhance overall system performance.
comment: 28 pages, 21 figures, 1 table
☆ An Extremely Data-efficient and Generative LLM-based Reinforcement Learning Agent for Recommenders
Recent advancements in large language models (LLMs) have enabled understanding webpage contexts, product details, and human instructions. Utilizing LLMs as the foundational architecture for either reward models or policies in reinforcement learning has gained popularity -- a notable achievement is the success of InstructGPT. RL algorithms have been instrumental in maximizing long-term customer satisfaction and avoiding short-term, myopic goals in industrial recommender systems, which often rely on deep learning models to predict immediate clicks or purchases. In this project, several RL methods are implemented and evaluated using the WebShop benchmark environment, data, simulator, and pre-trained model checkpoints. The goal is to train an RL agent to maximize the purchase reward given a detailed human instruction describing a desired product. The RL agents are developed by fine-tuning a pre-trained BERT model with various objectives, learning from preferences without a reward model, and employing contemporary training techniques such as Proximal Policy Optimization (PPO) as used in InstructGPT, and Direct Preference Optimization (DPO). This report also evaluates the RL agents trained using generative trajectories. Evaluations were conducted using Thompson sampling in the WebShop simulator environment. The simulated online experiments demonstrate that agents trained on generated trajectories exhibited comparable task performance to those trained using human trajectories. This has demonstrated an example of an extremely low-cost data-efficient way of training reinforcement learning agents. Also, with limited training time (<2hours), without utilizing any images, a DPO agent achieved a 19% success rate after approximately 3000 steps or 30 minutes of training on T4 GPUs, compared to a PPO agent, which reached a 15% success rate.
♻ ☆ Contextual Bandit with Herding Effects: Algorithms and Recommendation Applications PRICAI 2024
Contextual bandits serve as a fundamental algorithmic framework for optimizing recommendation decisions online. Though extensive attention has been paid to tailoring contextual bandits for recommendation applications, the "herding effects" in user feedback have been ignored. These herding effects bias user feedback toward historical ratings, breaking down the assumption of unbiased feedback inherent in contextual bandits. This paper develops a novel variant of the contextual bandit that is tailored to address the feedback bias caused by the herding effects. A user feedback model is formulated to capture this feedback bias. We design the TS-Conf (Thompson Sampling under Conformity) algorithm, which employs posterior sampling to balance the exploration and exploitation tradeoff. We prove an upper bound for the regret of the algorithm, revealing the impact of herding effects on learning speed. Extensive experiments on datasets demonstrate that TS-Conf outperforms four benchmark algorithms. Analysis reveals that TS-Conf effectively mitigates the negative impact of herding effects, resulting in faster learning and improved recommendation accuracy.
comment: Published as a conference paper at PRICAI 2024
♻ ☆ PASH at TREC 2021 Deep Learning Track: Generative Enhanced Model for Multi-stage Ranking
This paper describes the PASH participation in TREC 2021 Deep Learning Track. In the recall stage, we adopt a scheme combining sparse and dense retrieval method. In the multi-stage ranking phase, point-wise and pair-wise ranking strategies are used one after another based on model continual pre-trained on general knowledge and document-level data. Compared to TREC 2020 Deep Learning Track, we have additionally introduced the generative model T5 to further enhance the performance.
comment: TREC 2021
♻ ☆ WeKnow-RAG: An Adaptive Approach for Retrieval-Augmented Generation Integrating Web Search and Knowledge Graphs KDD
Large Language Models (LLMs) have greatly contributed to the development of adaptive intelligent agents and are positioned as an important way to achieve Artificial General Intelligence (AGI). However, LLMs are prone to produce factually incorrect information and often produce "phantom" content that undermines their reliability, which poses a serious challenge for their deployment in real-world scenarios. Enhancing LLMs by combining external databases and information retrieval mechanisms is an effective path. To address the above challenges, we propose a new approach called WeKnow-RAG, which integrates Web search and Knowledge Graphs into a "Retrieval-Augmented Generation (RAG)" system. First, the accuracy and reliability of LLM responses are improved by combining the structured representation of Knowledge Graphs with the flexibility of dense vector retrieval. WeKnow-RAG then utilizes domain-specific knowledge graphs to satisfy a variety of queries and domains, thereby improving performance on factual information and complex reasoning tasks by employing multi-stage web page retrieval techniques using both sparse and dense retrieval methods. Our approach effectively balances the efficiency and accuracy of information retrieval, thus improving the overall retrieval process. Finally, we also integrate a self-assessment mechanism for the LLM to evaluate the trustworthiness of the answers it generates. Our approach proves its outstanding effectiveness in a wide range of offline experiments and online submissions.
comment: 8 pages, 2 figures, technical report for 3rd place in Task 3 of Meta KDD Cup 2024 CRAG Challenge
♻ ☆ From Data Creator to Data Reuser: Distance Matters
Sharing research data is necessary, but not sufficient, for data reuse. Open science policies focus more heavily on data sharing than on reuse, yet both are complex, labor-intensive, expensive, and require infrastructure investments by multiple stakeholders. The value of data reuse lies in relationships between creators and reusers. By addressing knowledge exchange, rather than mere transactions between stakeholders, investments in data management and knowledge infrastructures can be made more wisely. Drawing upon empirical studies of data sharing and reuse, we develop the theoretical construct of distance between data creator and data reuser, identifying six distance dimensions that influence the ability to transfer knowledge effectively: domain, methods, collaboration, curation, purposes, and time and temporality. We address the social and socio-technical aspects of these dimensions, exploring ways in which they may decrease -- or increase -- distances between creators and reusers. Our theoretical framing of the distance between data creators and prospective reusers leads to recommendations to four categories of stakeholders on how to make data sharing and reuse more effective: data creators, data reusers, data archivists, and funding agencies. 'It takes a village' to share research data -- and a village to reuse data. Our aim is to provoke new research questions, new research, and new investments in effective and efficient circulation of research data; and to identify criteria for investments at each stage of data and research life cycles.
comment: 74 pages, double-spaced, consisting of Table of Contents, Abstract, 45 page narrative, 1 box, 1 figure, 1 table, 27 pages references. Original work
Machine Learning
☆ Q-MRS: A Deep Learning Framework for Quantitative Magnetic Resonance Spectra Analysis
Magnetic resonance spectroscopy (MRS) is an established technique for studying tissue metabolism, particularly in central nervous system disorders. While powerful and versatile, MRS is often limited by challenges associated with data quality, processing, and quantification. Existing MRS quantification methods face difficulties in balancing model complexity and reproducibility during spectral modeling, often falling into the trap of either oversimplification or over-parameterization. To address these limitations, this study introduces a deep learning (DL) framework that employs transfer learning, in which the model is pre-trained on simulated datasets before it undergoes fine-tuning on in vivo data. The proposed framework showed promising performance when applied to the Philips dataset from the BIG GABA repository and represents an exciting advancement in MRS data analysis.
comment: 8 pages, 4 figures, and 3 tables for the main body; 9 pages, 4 figures, and 3 tables for the supplementary material
☆ Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders
The ability to accurately interpret complex visual information is a crucial topic of multimodal large language models (MLLMs). Recent work indicates that enhanced visual perception significantly reduces hallucinations and improves performance on resolution-sensitive tasks, such as optical character recognition and document analysis. A number of recent MLLMs achieve this goal using a mixture of vision encoders. Despite their success, there is a lack of systematic comparisons and detailed ablation studies addressing critical aspects, such as expert selection and the integration of multiple vision experts. This study provides an extensive exploration of the design space for MLLMs using a mixture of vision encoders and resolutions. Our findings reveal several underlying principles common to various existing strategies, leading to a streamlined yet effective design approach. We discover that simply concatenating visual tokens from a set of complementary vision encoders is as effective as more complex mixing architectures or strategies. We additionally introduce Pre-Alignment to bridge the gap between vision-focused encoders and language tokens, enhancing model coherence. The resulting family of MLLMs, Eagle, surpasses other leading open-source models on major MLLM benchmarks. Models and code: https://github.com/NVlabs/Eagle
comment: Github: https://github.com/NVlabs/Eagle, HuggingFace: https://huggingface.co/NVEagle
☆ Mamba or Transformer for Time Series Forecasting? Mixture of Universals (MoU) Is All You Need
Time series forecasting requires balancing short-term and long-term dependencies for accurate predictions. Existing methods mainly focus on long-term dependency modeling, neglecting the complexities of short-term dynamics, which may hinder performance. Transformers are superior in modeling long-term dependencies but are criticized for their quadratic computational cost. Mamba provides a near-linear alternative but is reported less effective in time series longterm forecasting due to potential information loss. Current architectures fall short in offering both high efficiency and strong performance for long-term dependency modeling. To address these challenges, we introduce Mixture of Universals (MoU), a versatile model to capture both short-term and long-term dependencies for enhancing performance in time series forecasting. MoU is composed of two novel designs: Mixture of Feature Extractors (MoF), an adaptive method designed to improve time series patch representations for short-term dependency, and Mixture of Architectures (MoA), which hierarchically integrates Mamba, FeedForward, Convolution, and Self-Attention architectures in a specialized order to model long-term dependency from a hybrid perspective. The proposed approach achieves state-of-the-art performance while maintaining relatively low computational costs. Extensive experiments on seven real-world datasets demonstrate the superiority of MoU. Code is available at https://github.com/lunaaa95/mou/.
comment: Code at https://github.com/lunaaa95/mou/
☆ ClimDetect: A Benchmark Dataset for Climate Change Detection and Attribution
Detecting and attributing temperature increases due to climate change is crucial for understanding global warming and guiding adaptation strategies. The complexity of distinguishing human-induced climate signals from natural variability has challenged traditional detection and attribution (D&A) approaches, which seek to identify specific "fingerprints" in climate response variables. Deep learning offers potential for discerning these complex patterns in expansive spatial datasets. However, lack of standard protocols has hindered consistent comparisons across studies. We introduce ClimDetect, a standardized dataset of over 816k daily climate snapshots, designed to enhance model accuracy in identifying climate change signals. ClimDetect integrates various input and target variables used in past research, ensuring comparability and consistency. We also explore the application of vision transformers (ViT) to climate data, a novel and modernizing approach in this context. Our open-access data and code serve as a benchmark for advancing climate science through improved model evaluations. ClimDetect is publicly accessible via Huggingface dataet respository at: https://huggingface.co/datasets/ClimDetect/ClimDetect.
☆ CoGen: Learning from Feedback with Coupled Comprehension and Generation
Systems with both language comprehension and generation capabilities can benefit from the tight connection between the two. This work studies coupling comprehension and generation with focus on continually learning from interaction with users. We propose techniques to tightly integrate the two capabilities for both learning and inference. We situate our studies in two-player reference games, and deploy various models for thousands of interactions with human users, while learning from interaction feedback signals. We show dramatic improvements in performance over time, with comprehension-generation coupling leading to performance improvements up to 26% in absolute terms and up to 17% higher accuracies compared to a non-coupled system. Our analysis also shows coupling has substantial qualitative impact on the system's language, making it significantly more human-like.
comment: 17 pages, 9 figures
☆ Stability of Primal-Dual Gradient Flow Dynamics for Multi-Block Convex Optimization Problems
We examine stability properties of primal-dual gradient flow dynamics for composite convex optimization problems with multiple, possibly nonsmooth, terms in the objective function under the generalized consensus constraint. The proposed dynamics are based on the proximal augmented Lagrangian and they provide a viable alternative to ADMM which faces significant challenges from both analysis and implementation viewpoints in large-scale multi-block scenarios. In contrast to customized algorithms with individualized convergence guarantees, we provide a systematic approach for solving a broad class of challenging composite optimization problems. We leverage various structural properties to establish global (exponential) convergence guarantees for the proposed dynamics. Our assumptions are much weaker than those required to prove (exponential) stability of various primal-dual dynamics as well as (linear) convergence of discrete-time methods, e.g., standard two-block and multi-block ADMM and EXTRA algorithms. Finally, we show necessity of some of our structural assumptions for exponential stability and provide computational experiments to demonstrate the convenience of the proposed dynamics for parallel and distributed computing applications.
comment: 31 pages; 4 figures
☆ Efficient Slice Anomaly Detection Network for 3D Brain MRI Volume
Current anomaly detection methods excel with benchmark industrial data but struggle with natural images and medical data due to varying definitions of 'normal' and 'abnormal.' This makes accurate identification of deviations in these fields particularly challenging. Especially for 3D brain MRI data, all the state-of-the-art models are reconstruction-based with 3D convolutional neural networks which are memory-intensive, time-consuming and producing noisy outputs that require further post-processing. We propose a framework called Simple Slice-based Network (SimpleSliceNet), which utilizes a model pre-trained on ImageNet and fine-tuned on a separate MRI dataset as a 2D slice feature extractor to reduce computational cost. We aggregate the extracted features to perform anomaly detection tasks on 3D brain MRI volumes. Our model integrates a conditional normalizing flow to calculate log likelihood of features and employs the Semi-Push-Pull Mechanism to enhance anomaly detection accuracy. The results indicate improved performance, showcasing our model's remarkable adaptability and effectiveness when addressing the challenges exists in brain MRI data. In addition, for the large-scale 3D brain volumes, our model SimpleSliceNet outperforms the state-of-the-art 2D and 3D models in terms of accuracy, memory usage and time consumption. Code is available at: https://anonymous.4open.science/r/SimpleSliceNet-8EA3.
comment: 15 pages, 5 figures
☆ Generating Binary Species Range Maps
Accurately predicting the geographic ranges of species is crucial for assisting conservation efforts. Traditionally, range maps were manually created by experts. However, species distribution models (SDMs) and, more recently, deep learning-based variants offer a potential automated alternative. Deep learning-based SDMs generate a continuous probability representing the predicted presence of a species at a given location, which must be binarized by setting per-species thresholds to obtain binary range maps. However, selecting appropriate per-species thresholds to binarize these predictions is non-trivial as different species can require distinct thresholds. In this work, we evaluate different approaches for automatically identifying the best thresholds for binarizing range maps using presence-only data. This includes approaches that require the generation of additional pseudo-absence data, along with ones that only require presence data. We also propose an extension of an existing presence-only technique that is more robust to outliers. We perform a detailed evaluation of different thresholding techniques on the tasks of binary range estimation and large-scale fine-grained visual classification, and we demonstrate improved performance over existing pseudo-absence free approaches using our method.
☆ Modeling and Analyzing the Influence of Non-Item Pages on Sequential Next-Item Prediction
Analyzing the sequence of historical interactions between users and items, sequential recommendation models learn user intent and make predictions about the next item of interest. Next to these item interactions, most systems also have interactions with pages not related to specific items, for example navigation pages, account pages, and pages for a specific category, which may provide additional insights into the user's interests. However, while there are several approaches to integrate additional information about items and users, the topic of integrating non-item pages has been less explored. We use the hypotheses testing framework HypTrails to show that there is indeed a relationship between these non-item pages and the items of interest and fill this gap by proposing various approaches of representing non-item pages (e.g, based on their content) to use them as an additional information source for the task of sequential next-item prediction. We create a synthetic dataset with non-item pages highly related to the subsequent item to show that the models are generally capable of learning from these interactions, and subsequently evaluate the improvements gained by including non-item pages in two real-world datasets. We adapt eight popular sequential recommender models, covering CNN-, RNN- and transformer-based architectures, to integrate non-item pages and investigate the capabilities of these models to leverage their information for next item prediction. We also analyze their behavior on noisy data and compare different item representation strategies. Our results show that non-item pages are a valuable source of information, but representing such a page well is the key to successfully leverage them. The inclusion of non-item pages can increase the performance for next-item prediction in all examined model architectures with a varying degree.
comment: 36 pages, 19 figures; Work in Progress
☆ Sigma Flows for Image and Data Labeling and Learning Structured Prediction
This paper introduces the sigma flow model for the prediction of structured labelings of data observed on Riemannian manifolds, including Euclidean image domains as special case. The approach combines the Laplace-Beltrami framework for image denoising and enhancement, introduced by Sochen, Kimmel and Malladi about 25 years ago, and the assignment flow approach introduced and studied by the authors. The sigma flow arises as Riemannian gradient flow of generalized harmonic energies and thus is governed by a nonlinear geometric PDE which determines a harmonic map from a closed Riemannian domain manifold to a statistical manifold, equipped with the Fisher-Rao metric from information geometry. A specific ingredient of the sigma flow is the mutual dependency of the Riemannian metric of the domain manifold on the evolving state. This makes the approach amenable to machine learning in a specific way, by realizing this dependency through a mapping with compact time-variant parametrization that can be learned from data. Proof of concept experiments demonstrate the expressivity of the sigma flow model and prediction performance. Structural similarities to transformer network architectures and networks generated by the geometric integration of sigma flows are pointed out, which highlights the connection to deep learning and, conversely, may stimulate the use of geometric design principles for structured prediction in other areas of scientific machine learning.
comment: 51 pages
☆ Generalized Naive Bayes
In this paper we introduce the so-called Generalized Naive Bayes structure as an extension of the Naive Bayes structure. We give a new greedy algorithm that finds a good fitting Generalized Naive Bayes (GNB) probability distribution. We prove that this fits the data at least as well as the probability distribution determined by the classical Naive Bayes (NB). Then, under a not very restrictive condition, we give a second algorithm for which we can prove that it finds the optimal GNB probability distribution, i.e. best fitting structure in the sense of KL divergence. Both algorithms are constructed to maximize the information content and aim to minimize redundancy. Based on these algorithms, new methods for feature selection are introduced. We discuss the similarities and differences to other related algorithms in terms of structure, methodology, and complexity. Experimental results show, that the algorithms introduced outperform the related algorithms in many cases.
comment: 44 pages, 19 figures
☆ Multi-modal Adversarial Training for Zero-Shot Voice Cloning INTERSPEECH 2024
A text-to-speech (TTS) model trained to reconstruct speech given text tends towards predictions that are close to the average characteristics of a dataset, failing to model the variations that make human speech sound natural. This problem is magnified for zero-shot voice cloning, a task that requires training data with high variance in speaking styles. We build off of recent works which have used Generative Advsarial Networks (GAN) by proposing a Transformer encoder-decoder architecture to conditionally discriminates between real and generated speech features. The discriminator is used in a training pipeline that improves both the acoustic and prosodic features of a TTS model. We introduce our novel adversarial training technique by applying it to a FastSpeech2 acoustic model and training on Libriheavy, a large multi-speaker dataset, for the task of zero-shot voice cloning. Our model achieves improvements over the baseline in terms of speech quality and speaker similarity. Audio examples from our system are available online.
comment: Accepted at INTERSPEECH 2024
☆ MetaGFN: Exploring Distant Modes with Adapted Metadynamics for Continuous GFlowNets
Generative Flow Networks (GFlowNets) are a class of generative models that sample objects in proportion to a specified reward function through a learned policy. They can be trained either on-policy or off-policy, needing a balance between exploration and exploitation for fast convergence to a target distribution. While exploration strategies for discrete GFlowNets have been studied, exploration in the continuous case remains to be investigated, despite the potential for novel exploration algorithms due to the local connectedness of continuous domains. Here, we introduce Adapted Metadynamics, a variant of metadynamics that can be applied to arbitrary black-box reward functions on continuous domains. We use Adapted Metadynamics as an exploration strategy for continuous GFlowNets. We show three continuous domains where the resulting algorithm, MetaGFN, accelerates convergence to the target distribution and discovers more distant reward modes than previous off-policy exploration strategies used for GFlowNets.
comment: 10 pages
☆ Nexus: Specialization meets Adaptability for Efficiently Training Mixture of Experts
Efficiency, specialization, and adaptability to new data distributions are qualities that are hard to combine in current Large Language Models. The Mixture of Experts (MoE) architecture has been the focus of significant research because its inherent conditional computation enables such desirable properties. In this work, we focus on "upcycling" dense expert models into an MoE, aiming to improve specialization while also adding the ability to adapt to new tasks easily. We introduce Nexus, an enhanced MoE architecture with adaptive routing where the model learns to project expert embeddings from domain representations. This approach allows Nexus to flexibly add new experts after the initial upcycling through separately trained dense models, without requiring large-scale MoE training for unseen data domains. Our experiments show that Nexus achieves a relative gain of up to 2.1% over the baseline for initial upcycling, and a 18.8% relative gain for extending the MoE with a new expert by using limited finetuning data. This flexibility of Nexus is crucial to enable an open-source ecosystem where every user continuously assembles their own MoE-mix according to their needs.
☆ Airfoil Diffusion: Denoising Diffusion Model For Conditional Airfoil Generation
The design of aerodynamic shapes, such as airfoils, has traditionally required significant computational resources and relied on predefined design parameters, which limit the potential for novel shape synthesis. In this work, we introduce a data-driven methodology for airfoil generation using a diffusion model. Trained on a dataset of preexisting airfoils, our model can generate an arbitrary number of new airfoils from random vectors, which can be conditioned on specific aerodynamic performance metrics such as lift and drag, or geometric criteria. Our results demonstrate that the diffusion model effectively produces airfoil shapes with realistic aerodynamic properties, offering substantial improvements in efficiency, flexibility, and the potential for discovering innovative airfoil designs. This approach significantly expands the design space, facilitating the synthesis of high-performance aerodynamic shapes that transcend the limitations of traditional methods.
comment: 12 Pages, 6 figures
☆ A New Method for Cross-Lingual-based Semantic Role Labeling
Semantic role labeling is a crucial task in natural language processing, enabling better comprehension of natural language. However, the lack of annotated data in multiple languages has posed a challenge for researchers. To address this, a deep learning algorithm based on model transfer has been proposed. The algorithm utilizes a dataset consisting of the English portion of CoNLL2009 and a corpus of semantic roles in Persian. To optimize the efficiency of training, only ten percent of the educational data from each language is used. The results of the proposed model demonstrate significant improvements compared to Niksirt et al.'s model. In monolingual mode, the proposed model achieved a 2.05 percent improvement on F1-score, while in cross-lingual mode, the improvement was even more substantial, reaching 6.23 percent. Worth noting is that the compared model only trained two of the four stages of semantic role labeling and employed golden data for the remaining two stages. This suggests that the actual superiority of the proposed model surpasses the reported numbers by a significant margin. The development of cross-lingual methods for semantic role labeling holds promise, particularly in addressing the scarcity of annotated data for various languages. These advancements pave the way for further research in understanding and processing natural language across different linguistic contexts.
☆ Bias in LLMs as Annotators: The Effect of Party Cues on Labelling Decision by Large Language Models
Human coders are biased. We test similar biases in Large Language Models (LLMs) as annotators. By replicating an experiment run by Ennser-Jedenastik and Meyer (2018), we find evidence that LLMs use political information, and specifically party cues, to judge political statements. Not only do LLMs use relevant information to contextualize whether a statement is positive, negative, or neutral based on the party cue, they also reflect the biases of the human-generated data upon which they have been trained. We also find that unlike humans, who are only biased when faced with statements from extreme parties, LLMs exhibit significant bias even when prompted with statements from center-left and center-right parties. The implications of our findings are discussed in the conclusion.
☆ The Role of Fibration Symmetries in Geometric Deep Learning
Geometric Deep Learning (GDL) unifies a broad class of machine learning techniques from the perspectives of symmetries, offering a framework for introducing problem-specific inductive biases like Graph Neural Networks (GNNs). However, the current formulation of GDL is limited to global symmetries that are not often found in real-world problems. We propose to relax GDL to allow for local symmetries, specifically fibration symmetries in graphs, to leverage regularities of realistic instances. We show that GNNs apply the inductive bias of fibration symmetries and derive a tighter upper bound for their expressive power. Additionally, by identifying symmetries in networks, we collapse network nodes, thereby increasing their computational efficiency during both inference and training of deep neural networks. The mathematical extension introduced here applies beyond graphs to manifolds, bundles, and grids for the development of models with inductive biases induced by local symmetries that can lead to better generalization.
☆ Robust Statistical Scaling of Outlier Scores: Improving the Quality of Outlier Probabilities for Outliers (Extended Version)
Outlier detection algorithms typically assign an outlier score to each observation in a dataset, indicating the degree to which an observation is an outlier. However, these scores are often not comparable across algorithms and can be difficult for humans to interpret. Statistical scaling addresses this problem by transforming outlier scores into outlier probabilities without using ground-truth labels, thereby improving interpretability and comparability across algorithms. However, the quality of this transformation can be different for outliers and inliers. Missing outliers in scenarios where they are of particular interest - such as healthcare, finance, or engineering - can be costly or dangerous. Thus, ensuring good probabilities for outliers is essential. This paper argues that statistical scaling, as commonly used in the literature, does not produce equally good probabilities for outliers as for inliers. Therefore, we propose robust statistical scaling, which uses robust estimators to improve the probabilities for outliers. We evaluate several variants of our method against other outlier score transformations for real-world datasets and outlier detection algorithms, where it can improve the probabilities for outliers.
comment: 15 pages, 4 figures, accepted for publication in SISAP 2024
☆ Retrieval-Augmented Instruction Tuning for Automated Process Engineering Calculations : A Tool-Chaining Problem-Solving Framework with Attributable Reflection ECML
The current technology landscape lacks a foundational AI model for solving process engineering calculations. In this work, we introduce a novel autonomous agent framework leveraging Retrieval-Augmented Instruction-Tuning (RAIT) to enhance open, customizable small code language models (SLMs) for these calculations. By combining instruction tuned code SLMs with Retrieval-Augmented Code Generation (RACG) using external tools, the agent generates, debugs, and optimizes code from natural language specifications. Our approach addresses the limitations of the current lack of a foundational AI model for specialized process engineering tasks and offers benefits of explainability, knowledge editing, and cost-effectiveness. Additionally, we curate custom datasets of chemical and process engineering problems and solutions to overcome data scarcity. Experimental results show that our framework matches the performance of large-scale proprietary models on benchmark datasets, proving its effectiveness and usability.
comment: Accepted for publication at ML4CCE workshop at ECML PKDD 2024. Please find the link: https://ml4cce-ecml.com/#agenda
☆ microYOLO: Towards Single-Shot Object Detection on Microcontrollers ECML
This work-in-progress paper presents results on the feasibility of single-shot object detection on microcontrollers using YOLO. Single-shot object detectors like YOLO are widely used, however due to their complexity mainly on larger GPU-based platforms. We present microYOLO, which can be used on Cortex-M based microcontrollers, such as the OpenMV H7 R2, achieving about 3.5 FPS when classifying 128x128 RGB images while using less than 800 KB Flash and less than 350 KB RAM. Furthermore, we share experimental results for three different object detection tasks, analyzing the accuracy of microYOLO on them.
comment: Published at the ECML PKDD Conference 2023, at the 4th Workshop on IoT, Edge, and Mobile for Embedded Machine Learning
☆ Fusing Pruned and Backdoored Models: Optimal Transport-based Data-free Backdoor Mitigation
Backdoor attacks present a serious security threat to deep neuron networks (DNNs). Although numerous effective defense techniques have been proposed in recent years, they inevitably rely on the availability of either clean or poisoned data. In contrast, data-free defense techniques have evolved slowly and still lag significantly in performance. To address this issue, different from the traditional approach of pruning followed by fine-tuning, we propose a novel data-free defense method named Optimal Transport-based Backdoor Repairing (OTBR) in this work. This method, based on our findings on neuron weight changes (NWCs) of random unlearning, uses optimal transport (OT)-based model fusion to combine the advantages of both pruned and backdoored models. Specifically, we first demonstrate our findings that the NWCs of random unlearning are positively correlated with those of poison unlearning. Based on this observation, we propose a random-unlearning NWC pruning technique to eliminate the backdoor effect and obtain a backdoor-free pruned model. Then, motivated by the OT-based model fusion, we propose the pruned-to-backdoored OT-based fusion technique, which fuses pruned and backdoored models to combine the advantages of both, resulting in a model that demonstrates high clean accuracy and a low attack success rate. To our knowledge, this is the first work to apply OT and model fusion techniques to backdoor defense. Extensive experiments show that our method successfully defends against all seven backdoor attacks across three benchmark datasets, outperforming both state-of-the-art (SOTA) data-free and data-dependent methods. The code implementation and Appendix are provided in the Supplementary Material.
☆ chemtrain: Learning Deep Potential Models via Automatic Differentiation and Statistical Physics
Neural Networks (NNs) are promising models for refining the accuracy of molecular dynamics, potentially opening up new fields of application. Typically trained bottom-up, atomistic NN potential models can reach first-principle accuracy, while coarse-grained implicit solvent NN potentials surpass classical continuum solvent models. However, overcoming the limitations of costly generation of accurate reference data and data inefficiency of common bottom-up training demands efficient incorporation of data from many sources. This paper introduces the framework chemtrain to learn sophisticated NN potential models through customizable training routines and advanced training algorithms. These routines can combine multiple top-down and bottom-up algorithms, e.g., to incorporate both experimental and simulation data or pre-train potentials with less costly algorithms. chemtrain provides an object-oriented high-level interface to simplify the creation of custom routines. On the lower level, chemtrain relies on JAX to compute gradients and scale the computations to use available resources. We demonstrate the simplicity and importance of combining multiple algorithms in the examples of parametrizing an all-atomistic model of titanium and a coarse-grained implicit solvent model of alanine dipeptide.
comment: Package source code published at http://github.com/tummfm/chemtrain
☆ Automatic Differential Diagnosis using Transformer-Based Multi-Label Sequence Classification
As the field of artificial intelligence progresses, assistive technologies are becoming more widely used across all industries. The healthcare industry is no different, with numerous studies being done to develop assistive tools for healthcare professionals. Automatic diagnostic systems are one such beneficial tool that can assist with a variety of tasks, including collecting patient information, analyzing test results, and diagnosing patients. However, the idea of developing systems that can provide a differential diagnosis has been largely overlooked in most of these research studies. In this study, we propose a transformer-based approach for providing differential diagnoses based on a patient's age, sex, medical history, and symptoms. We use the DDXPlus dataset, which provides differential diagnosis information for patients based on 49 disease types. Firstly, we propose a method to process the tabular patient data from the dataset and engineer them into patient reports to make them suitable for our research. In addition, we introduce two data modification modules to diversify the training data and consequently improve the robustness of the models. We approach the task as a multi-label classification problem and conduct extensive experiments using four transformer models. All the models displayed promising results by achieving over 97% F1 score on the held-out test set. Moreover, we design additional behavioral tests to get a broader understanding of the models. In particular, for one of our test cases, we prepared a custom test set of 100 samples with the assistance of a doctor. The results on the custom set showed that our proposed data modification modules improved the model's generalization capabilities. We hope our findings will provide future researchers with valuable insights and inspire them to develop reliable systems for automatic differential diagnosis.
comment: 25 pages, 7 figures
☆ Automated Mixture Analysis via Structural Evaluation
The determination of chemical mixture components is vital to a multitude of scientific fields. Oftentimes spectroscopic methods are employed to decipher the composition of these mixtures. However, the sheer density of spectral features present in spectroscopic databases can make unambiguous assignment to individual species challenging. Yet, components of a mixture are commonly chemically related due to environmental processes or shared precursor molecules. Therefore, analysis of the chemical relevance of a molecule is important when determining which species are present in a mixture. In this paper, we combine machine-learning molecular embedding methods with a graph-based ranking system to determine the likelihood of a molecule being present in a mixture based on the other known species and/or chemical priors. By incorporating this metric in a rotational spectroscopy mixture analysis algorithm, we demonstrate that the mixture components can be identified with extremely high accuracy (>97%) in an efficient manner.
comment: Accepted for publication in The Journal of Physical Chemistry A
☆ Language Adaptation on a Tight Academic Compute Budget: Tokenizer Swapping Works and Pure bfloat16 Is Enough ICML 2024
We investigate continued pretraining of LLMs for language adaptation on a tight academic budget: a setting in which only a few GPUs can be used in parallel, for a heavily constrained duration. We focus on adapting Mistral-7B to German or Arabic and evaluate several techniques to improve efficiency and effectiveness in this setting. Our German models adapted on this tight compute budget underperform compared to the base Mistral-7B, while our Arabic models outperform several baselines, showing that for sufficiently well-represented languages, continued pretraining for specialization is not always helpful. Our main findings focus on training precision and tokenizer swapping. Our results show that pure bfloat16 training is a viable alternative to mixed-precision training, while being much faster when only using a few GPUs. Swapping the tokenizer for a specialized one yields more efficient tokenization and is competitive with the original tokenizer, which already contains some German tokens, but did not significantly increase performance for German. Code and model weights are available at on GitHub.
comment: WANT@ICML 2024
☆ Efficient LLM Scheduling by Learning to Rank
In Large Language Model (LLM) inference, the output length of an LLM request is typically regarded as not known a priori. Consequently, most LLM serving systems employ a simple First-come-first-serve (FCFS) scheduling strategy, leading to Head-Of-Line (HOL) blocking and reduced throughput and service quality. In this paper, we reexamine this assumption -- we show that, although predicting the exact generation length of each request is infeasible, it is possible to predict the relative ranks of output lengths in a batch of requests, using learning to rank. The ranking information offers valuable guidance for scheduling requests. Building on this insight, we develop a novel scheduler for LLM inference and serving that can approximate the shortest-job-first (SJF) schedule better than existing approaches. We integrate this scheduler with the state-of-the-art LLM serving system and show significant performance improvement in several important applications: 2.8x lower latency in chatbot serving and 6.5x higher throughput in synthetic data generation. Our code is available at https://github.com/hao-ai-lab/vllm-ltr.git
☆ Implicit Regularization Paths of Weighted Neural Representations
We study the implicit regularization effects induced by (observation) weighting of pretrained features. For weight and feature matrices of bounded operator norms that are infinitesimally free with respect to (normalized) trace functionals, we derive equivalence paths connecting different weighting matrices and ridge regularization levels. Specifically, we show that ridge estimators trained on weighted features along the same path are asymptotically equivalent when evaluated against test vectors of bounded norms. These paths can be interpreted as matching the effective degrees of freedom of ridge estimators fitted with weighted features. For the special case of subsampling without replacement, our results apply to independently sampled random features and kernel features and confirm recent conjectures (Conjectures 7 and 8) of the authors on the existence of such paths in Patil et al. We also present an additive risk decomposition for ensembles of weighted estimators and show that the risks are equivalent along the paths when the ensemble size goes to infinity. As a practical consequence of the path equivalences, we develop an efficient cross-validation method for tuning and apply it to subsampled pretrained representations across several models (e.g., ResNet-50) and datasets (e.g., CIFAR-100).
comment: 19 pages for main and 19 pages for appendix
☆ wav2pos: Sound Source Localization using Masked Autoencoders
We present a novel approach to the 3D sound source localization task for distributed ad-hoc microphone arrays by formulating it as a set-to-set regression problem. By training a multi-modal masked autoencoder model that operates on audio recordings and microphone coordinates, we show that such a formulation allows for accurate localization of the sound source, by reconstructing coordinates masked in the input. Our approach is flexible in the sense that a single model can be used with an arbitrary number of microphones, even when a subset of audio recordings and microphone coordinates are missing. We test our method on simulated and real-world recordings of music and speech in indoor environments, and demonstrate competitive performance compared to both classical and other learning based localization methods.
comment: IPIN 2024
☆ Harmonized Speculative Sampling
Speculative sampling has proven to be an effective solution to accelerate decoding from large language models, where the acceptance rate significantly determines the performance. Most previous works on improving the acceptance rate focus on aligned training and efficient decoding, implicitly paying less attention to the linkage of training and decoding. In this work, we first investigate the linkage of training and decoding for speculative sampling and then propose a solution named HArmonized Speculative Sampling (HASS). HASS improves the acceptance rate without extra inference overhead by harmonizing training and decoding on their objectives and contexts. Experiments on three LLaMA models demonstrate that HASS achieves 2.81x-3.65x wall-clock time speedup ratio averaging across three datasets, which is 8%-15% faster than EAGLE-2.
☆ A Neural Material Point Method for Particle-based Simulations
Mesh-free Lagrangian methods are widely used for simulating fluids, solids, and their complex interactions due to their ability to handle large deformations and topological changes. These physics simulators, however, require substantial computational resources for accurate simulations. To address these issues, deep learning emulators promise faster and scalable simulations, yet they often remain expensive and difficult to train, limiting their practical use. Inspired by the Material Point Method (MPM), we present NeuralMPM, a neural emulation framework for particle-based simulations. NeuralMPM interpolates Lagrangian particles onto a fixed-size grid, computes updates on grid nodes using image-to-image neural networks, and interpolates back to the particles. Similarly to MPM, NeuralMPM benefits from the regular voxelized representation to simplify the computation of the state dynamics, while avoiding the drawbacks of mesh-based Eulerian methods. We demonstrate the advantages of NeuralMPM on several datasets, including fluid dynamics and fluid-solid interactions. Compared to existing methods, NeuralMPM reduces training times from days to hours, while achieving comparable or superior long-term accuracy, making it a promising approach for practical forward and inverse problems. A project page is available at https://neuralmpm.isach.be
☆ Advanced POD-Based Performance Evaluation of Classifiers Applied to Human Driver Lane Changing Prediction
Machine learning (ML) classifiers serve as essential tools facilitating classification and prediction across various domains. The performance of these algorithms should be known to ensure their reliable application. In certain fields, receiver operating characteristic and precision-recall curves are frequently employed to assess machine learning algorithms without accounting for the impact of process parameters. However, it may be essential to evaluate the performance of these algorithms in relation to such parameters. As a performance evaluation metric capable of considering the effects of process parameters, this paper uses a modified probability of detection (POD) approach to assess the reliability of ML-based algorithms. As an example, the POD-based approach is employed to assess ML models used for predicting the lane changing behavior of a vehicle driver. The time remaining to the predicted (and therefore unknown) lane changing event is considered as process parameter. The hit/miss approach to POD is taken here and modified by considering the probability of lane changing derived from ML algorithms at each time step, and obtaining the final result of the analysis accordingly. This improves the reliability of results compared to the standard hit/miss approach, which considers the outcome of the classifiers as either 0 or 1, while also simplifying evaluation compared to the \^a versus a approach. Performance evaluation results of the proposed approach are compared with those obtained with the standard hit/miss approach and a pre-developed \^a versus a approach to validate the effectiveness of the proposed method. The comparison shows that this method provides an averaging conservative behavior with the advantage of enhancing the reliability of the hit/miss approach to POD while retaining its simplicity.
comment: Manuscript: 8 pages, 6 figures, 4 tables
☆ Autoregressive model path dependence near Ising criticality
Autoregressive models are a class of generative model that probabilistically predict the next output of a sequence based on previous inputs. The autoregressive sequence is by definition one-dimensional (1D), which is natural for language tasks and hence an important component of modern architectures like recurrent neural networks (RNNs) and transformers. However, when language models are used to predict outputs on physical systems that are not intrinsically 1D, the question arises of which choice of autoregressive sequence -- if any -- is optimal. In this paper, we study the reconstruction of critical correlations in the two-dimensional (2D) Ising model, using RNNs and transformers trained on binary spin data obtained near the thermal phase transition. We compare the training performance for a number of different 1D autoregressive sequences imposed on finite-size 2D lattices. We find that paths with long 1D segments are more efficient at training the autoregressive models compared to space-filling curves that better preserve the 2D locality. Our results illustrate the potential importance in choosing the optimal autoregressive sequence ordering when training modern language models for tasks in physics.
comment: 12 pages, 4 figures
☆ Pixels to Prose: Understanding the art of Image Captioning
In the era of evolving artificial intelligence, machines are increasingly emulating human-like capabilities, including visual perception and linguistic expression. Image captioning stands at the intersection of these domains, enabling machines to interpret visual content and generate descriptive text. This paper provides a thorough review of image captioning techniques, catering to individuals entering the field of machine learning who seek a comprehensive understanding of available options, from foundational methods to state-of-the-art approaches. Beginning with an exploration of primitive architectures, the review traces the evolution of image captioning models to the latest cutting-edge solutions. By dissecting the components of these architectures, readers gain insights into the underlying mechanisms and can select suitable approaches tailored to specific problem requirements without duplicating efforts. The paper also delves into the application of image captioning in the medical domain, illuminating its significance in various real-world scenarios. Furthermore, the review offers guidance on evaluating the performance of image captioning systems, highlighting key metrics for assessment. By synthesizing theoretical concepts with practical application, this paper equips readers with the knowledge needed to navigate the complex landscape of image captioning and harness its potential for diverse applications in machine learning and beyond.
☆ Evaluating Model Robustness Using Adaptive Sparse L0 Regularization
Deep Neural Networks have demonstrated remarkable success in various domains but remain susceptible to adversarial examples, which are slightly altered inputs designed to induce misclassification. While adversarial attacks typically optimize under Lp norm constraints, attacks based on the L0 norm, prioritising input sparsity, are less studied due to their complex and non convex nature. These sparse adversarial examples challenge existing defenses by altering a minimal subset of features, potentially uncovering more subtle DNN weaknesses. However, the current L0 norm attack methodologies face a trade off between accuracy and efficiency either precise but computationally intense or expedient but imprecise. This paper proposes a novel, scalable, and effective approach to generate adversarial examples based on the L0 norm, aimed at refining the robustness evaluation of DNNs against such perturbations.
comment: Accepted by the 20th International Conference on Advanced Data Mining and Applications (ADMA 2024)
☆ Towards reliable respiratory disease diagnosis based on cough sounds and vision transformers
Recent advancements in deep learning techniques have sparked performance boosts in various real-world applications including disease diagnosis based on multi-modal medical data. Cough sound data-based respiratory disease (e.g., COVID-19 and Chronic Obstructive Pulmonary Disease) diagnosis has also attracted much attention. However, existing works usually utilise traditional machine learning or deep models of moderate scales. On the other hand, the developed approaches are trained and evaluated on small-scale data due to the difficulty of curating and annotating clinical data on scale. To address these issues in prior works, we create a unified framework to evaluate various deep models from lightweight Convolutional Neural Networks (e.g., ResNet18) to modern vision transformers and compare their performance in respiratory disease classification. Based on the observations from such an extensive empirical study, we propose a novel approach to cough-based disease classification based on both self-supervised and supervised learning on a large-scale cough data set. Experimental results demonstrate our proposed approach outperforms prior arts consistently on two benchmark datasets for COVID-19 diagnosis and a proprietary dataset for COPD/non-COPD classification with an AUROC of 92.5%.
☆ Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts
For Mixture-of-Experts (MoE) models, an unbalanced expert load will lead to routing collapse or increased computational overhead. Existing methods commonly employ an auxiliary loss to encourage load balance, but a large auxiliary loss will introduce non-negligible interference gradients into training and thus impair the model performance. In order to control load balance while not producing undesired gradients during training, we propose Loss-Free Balancing, featured by an auxiliary-loss-free load balancing strategy. To be specific, before the top-K routing decision, Loss-Free Balancing will first apply an expert-wise bias to the routing scores of each expert. By dynamically updating the bias of each expert according to its recent load, Loss-Free Balancing can consistently maintain a balanced distribution of expert load. In addition, since Loss-Free Balancing does not produce any interference gradients, it also elevates the upper bound of model performance gained from MoE training. We validate the performance of Loss-Free Balancing on MoE models with up to 3B parameters trained on up to 200B tokens. Experimental results show that Loss-Free Balancing achieves both better performance and better load balance compared with traditional auxiliary-loss-controlled load balancing strategies.
☆ GANs Conditioning Methods: A Survey
In recent years, Generative Adversarial Networks (GANs) have seen significant advancements, leading to their widespread adoption across various fields. The original GAN architecture enables the generation of images without any specific control over the content, making it an unconditional generation process. However, many practical applications require precise control over the generated output, which has led to the development of conditional GANs (cGANs) that incorporate explicit conditioning to guide the generation process. cGANs extend the original framework by incorporating additional information (conditions), enabling the generation of samples that adhere to that specific criteria. Various conditioning methods have been proposed, each differing in how they integrate the conditioning information into both the generator and the discriminator networks. In this work, we review the conditioning methods proposed for GANs, exploring the characteristics of each method and highlighting their unique mechanisms and theoretical foundations. Furthermore, we conduct a comparative analysis of these methods, evaluating their performance on various image datasets. Through these analyses, we aim to provide insights into the strengths and limitations of various conditioning techniques, guiding future research and application in generative modeling.
☆ Comparison of Model Predictive Control and Proximal Policy Optimization for a 1-DOF Helicopter System
This study conducts a comparative analysis of Model Predictive Control (MPC) and Proximal Policy Optimization (PPO), a Deep Reinforcement Learning (DRL) algorithm, applied to a 1-Degree of Freedom (DOF) Quanser Aero 2 system. Classical control techniques such as MPC and Linear Quadratic Regulator (LQR) are widely used due to their theoretical foundation and practical effectiveness. However, with advancements in computational techniques and machine learning, DRL approaches like PPO have gained traction in solving optimal control problems through environment interaction. This paper systematically evaluates the dynamic response characteristics of PPO and MPC, comparing their performance, computational resource consumption, and implementation complexity. Experimental results show that while LQR achieves the best steady-state accuracy, PPO excels in rise-time and adaptability, making it a promising approach for applications requiring rapid response and adaptability. Additionally, we have established a baseline for future RL-related research on this specific testbed. We also discuss the strengths and limitations of each control strategy, providing recommendations for selecting appropriate controllers for real-world scenarios.
comment: Accepted at INDIN2024
☆ Convergent Differential Privacy Analysis for General Federated Learning: the f-DP Perspective
Federated learning (FL) is an efficient collaborative training paradigm extensively developed with a focus on local privacy protection, and differential privacy (DP) is a classical approach to capture and ensure the reliability of local privacy. The powerful cooperation of FL and DP provides a promising learning framework for large-scale private clients, juggling both privacy securing and trustworthy learning. As the predominant algorithm of DP, the noisy perturbation has been widely studied and incorporated into various federated algorithms, theoretically proven to offer significant privacy protections. However, existing analyses in noisy FL-DP mostly rely on the composition theorem and cannot tightly quantify the privacy leakage challenges, which is nearly tight for small numbers of communication rounds but yields an arbitrarily loose and divergent bound under the large communication rounds. This implies a counterintuitive judgment, suggesting that FL may not provide adequate privacy protection during long-term training. To further investigate the convergent privacy and reliability of the FL-DP framework, in this paper, we comprehensively evaluate the worst privacy of two classical methods under the non-convex and smooth objectives based on the f-DP analysis, i.e. Noisy-FedAvg and Noisy-FedProx methods. With the aid of the shifted-interpolation technique, we successfully prove that the worst privacy of the Noisy-FedAvg method achieves a tight convergent lower bound. Moreover, in the Noisy-FedProx method, with the regularization of the proxy term, the worst privacy has a stable constant lower bound. Our analysis further provides a solid theoretical foundation for the reliability of privacy protection in FL-DP. Meanwhile, our conclusions can also be losslessly converted to other classical DP analytical frameworks, e.g. $(\epsilon,\delta)$-DP and R$\acute{\text{e}}$nyi-DP (RDP).
☆ CAPER: Enhancing Career Trajectory Prediction using Temporal Knowledge Graph and Ternary Relationship
The problem of career trajectory prediction (CTP) aims to predict one's future employer or job position. While several CTP methods have been developed for this problem, we posit that none of these methods (1) jointly considers the mutual ternary dependency between three key units (i.e., user, position, and company) of a career and (2) captures the characteristic shifts of key units in career over time, leading to an inaccurate understanding of the job movement patterns in the labor market. To address the above challenges, we propose a novel solution, named as CAPER, that solves the challenges via sophisticated temporal knowledge graph (TKG) modeling. It enables the utilization of a graph-structured knowledge base with rich expressiveness, effectively preserving the changes in job movement patterns. Furthermore, we devise an extrapolated career reasoning task on TKG for a realistic evaluation. The experiments on a real-world career trajectory dataset demonstrate that CAPER consistently and significantly outperforms four baselines, two recent TKG reasoning methods, and five state-of-the-art CTP methods in predicting one's future companies and positions-i.e., on average, yielding 6.80% and 34.58% more accurate predictions, respectively.
☆ Large-Scale Demand Prediction in Urban Rail using Multi-Graph Inductive Representation Learning
With the expansion of cities over time, URT (Urban Rail Transit) networks have also grown significantly. Demand prediction plays an important role in supporting planning, scheduling, fleet management, and other operational decisions. In this study, we propose an Origin-Destination (OD) demand prediction model called Multi-Graph Inductive Representation Learning (mGraphSAGE) for large-scale URT networks under operational uncertainties. Our main contributions are twofold: we enhance prediction results while ensuring scalability for large networks by relying simultaneously on multiple graphs, where each OD pair is a node on a graph and distinct OD relationships, such as temporal and spatial correlations; we show the importance of including operational uncertainties such as train delays and cancellations as inputs in demand prediction for daily operations. The model is validated on three different scales of the URT network in Copenhagen, Denmark. Experimental results show that by leveraging information from neighboring ODs and learning node representations via sampling and aggregation, mGraphSAGE is particularly suitable for OD demand prediction in large-scale URT networks, outperforming reference machine learning methods. Furthermore, during periods with train cancellations and delays, the performance gap between mGraphSAGE and other methods improves compared to normal operating conditions, demonstrating its ability to leverage system reliability information for predicting OD demand under uncertainty.
comment: 18 pages, 3 figures
☆ Statistical QoS Provision in Business-Centric Networks
More refined resource management and Quality of Service (QoS) provisioning is a critical goal of wireless communication technologies. In this paper, we propose a novel Business-Centric Network (BCN) aimed at enabling scalable QoS provisioning, based on a cross-layer framework that captures the relationship between application, transport parameters, and channels. We investigate both continuous flow and event-driven flow models, presenting key QoS metrics such as throughput, delay, and reliability. By jointly considering power and bandwidth allocation, transmission parameters, and AP network topology across layers, we optimize weighted resource efficiency with statistical QoS provisioning. To address the coupling among parameters, we propose a novel deep reinforcement learning (DRL) framework, which is Collaborative Optimization among Heterogeneous Actors with Experience Sharing (COHA-ES). Power and sub-channel (SC) Actors representing multiple APs are jointly optimized under the unified guidance of a common critic. Additionally, we introduce a novel multithreaded experience-sharing mechanism to accelerate training and enhance rewards. Extensive comparative experiments validate the effectiveness of our DRL framework in terms of convergence and efficiency. Moreover, comparative analyses demonstrate the comprehensive advantages of the BCN structure in enhancing both spectral and energy efficiency.
comment: 13 pages
☆ Grand canonical generative diffusion model for crystalline phases and grain boundaries
The diffusion model has emerged as a powerful tool for generating atomic structures for materials science. This work calls attention to the deficiency of current particle-based diffusion models, which represent atoms as a point cloud, in generating even the simplest ordered crystalline structures. The problem is attributed to particles being trapped in local minima during the score-driven simulated annealing of the diffusion process, similar to the physical process of force-driven simulated annealing. We develop a solution, the grand canonical diffusion model, which adopts an alternative voxel-based representation with continuous rather than fixed number of particles. The method is applied towards generation of several common crystalline phases as well as the technologically important and challenging problem of grain boundary structures.
☆ Exploring Selective Layer Fine-Tuning in Federated Learning
Federated learning (FL) has emerged as a promising paradigm for fine-tuning foundation models using distributed data in a privacy-preserving manner. Under limited computational resources, clients often find it more practical to fine-tune a selected subset of layers, rather than the entire model, based on their task-specific data. In this study, we provide a thorough theoretical exploration of selective layer fine-tuning in FL, emphasizing a flexible approach that allows the clients to adjust their selected layers according to their local data and resources. We theoretically demonstrate that the layer selection strategy has a significant impact on model convergence in two critical aspects: the importance of selected layers and the heterogeneous choices across clients. Drawing from these insights, we further propose a strategic layer selection method that utilizes local gradients and regulates layer selections across clients. The extensive experiments on both image and text datasets demonstrate the effectiveness of the proposed strategy compared with several baselines, highlighting its advances in identifying critical layers that adapt to the client heterogeneity and training dynamics in FL.
☆ Skills Regularized Task Decomposition for Multi-task Offline Reinforcement Learning NeurIPS 2022
Reinforcement learning (RL) with diverse offline datasets can have the advantage of leveraging the relation of multiple tasks and the common skills learned across those tasks, hence allowing us to deal with real-world complex problems efficiently in a data-driven way. In offline RL where only offline data is used and online interaction with the environment is restricted, it is yet difficult to achieve the optimal policy for multiple tasks, especially when the data quality varies for the tasks. In this paper, we present a skill-based multi-task RL technique on heterogeneous datasets that are generated by behavior policies of different quality. To learn the shareable knowledge across those datasets effectively, we employ a task decomposition method for which common skills are jointly learned and used as guidance to reformulate a task in shared and achievable subtasks. In this joint learning, we use Wasserstein auto-encoder (WAE) to represent both skills and tasks on the same latent space and use the quality-weighted loss as a regularization term to induce tasks to be decomposed into subtasks that are more consistent with high-quality skills than others. To improve the performance of offline RL agents learned on the latent space, we also augment datasets with imaginary trajectories relevant to high-quality skills for each task. Through experiments, we show that our multi-task offline RL approach is robust to the mixed configurations of different-quality datasets and it outperforms other state-of-the-art algorithms for several robotic manipulation tasks and drone navigation tasks.
comment: 12 pages, 5 figures, acceepted in NeurIPS 2022
☆ VFLIP: A Backdoor Defense for Vertical Federated Learning via Identification and Purification ESORICS 2024
Vertical Federated Learning (VFL) focuses on handling vertically partitioned data over FL participants. Recent studies have discovered a significant vulnerability in VFL to backdoor attacks which specifically target the distinct characteristics of VFL. Therefore, these attacks may neutralize existing defense mechanisms designed primarily for Horizontal Federated Learning (HFL) and deep neural networks. In this paper, we present the first backdoor defense, called VFLIP, specialized for VFL. VFLIP employs the identification and purification techniques that operate at the inference stage, consequently improving the robustness against backdoor attacks to a great extent. VFLIP first identifies backdoor-triggered embeddings by adopting a participant-wise anomaly detection approach. Subsequently, VFLIP conducts purification which removes the embeddings identified as malicious and reconstructs all the embeddings based on the remaining embeddings. We conduct extensive experiments on CIFAR10, CINIC10, Imagenette, NUS-WIDE, and BankMarketing to demonstrate that VFLIP can effectively mitigate backdoor attacks in VFL. https://github.com/blingcho/VFLIP-esorics24
comment: Accepted by 29th European Symposium on Research in Computer Security (ESORICS 2024)
☆ Bayesian optimization of atomic structures with prior probabilities from universal interatomic potentials
The optimization of atomic structures plays a pivotal role in understanding and designing materials with desired properties. However, conventional methods often struggle with the formidable task of navigating the vast potential energy surface, especially in high-dimensional spaces with numerous local minima. Recent advancements in machine learning-driven surrogate models offer a promising avenue for alleviating this computational burden. In this study, we propose a novel approach that combines the strengths of universal machine learning potentials with a Bayesian approach of the GOFEE/BEACON framework. By leveraging the comprehensive chemical knowledge encoded in pretrained universal machine learning potentials as a prior estimate of energy and forces, we enable the Gaussian process to focus solely on capturing the intricate nuances of the potential energy surface. We demonstrate the efficacy of our approach through comparative analyses across diverse systems, including periodic bulk materials, surface structures, and a cluster.
☆ Boosting Lossless Speculative Decoding via Feature Sampling and Partial Alignment Distillation AAAI 2025
Lossless speculative decoding accelerates target large language model (LLM) inference by employing a lightweight draft model for generating tree-structured candidates, which are subsequently verified in parallel by the target LLM. Currently, effective approaches leverage feature-level rather than token-level autoregression within the draft model to facilitate more straightforward predictions and enhanced knowledge distillation. In this paper, we reassess these approaches and propose FSPAD (Feature Sampling and Partial Alignment Distillation for Lossless Speculative Decoding), which introduces two straightforward and effective components within the existing framework to boost lossless speculative decoding. Firstly, FSPAD utilizes token embeddings to sample features of the target LLM in high-dimensional space before feeding them into the draft model, due to the inherent uncertainty of the features preventing the draft model from obtaining the specific token output by the target LLM. Secondly, FSPAD introduces partial alignment distillation to weaken the draft model's connection between features and logits, aiming to reduce the conflict between feature alignment and logit confidence during training. Our experiments include both greedy and non-greedy decoding on the largest and smallest models from the Vicuna and LLaMA3-Instruct series, as well as tasks in multi-turn conversation, translation, summarization, question answering, mathematical reasoning, and retrieval-augmented generation. The results show that FSPAD outperforms the state-of-the-art method across all the aforementioned tasks and target LLMs.
comment: The work was not submitted to AAAI 2025
☆ Latent Relationship Mining of Glaucoma Biomarkers: a TRI-LSTM based Deep Learning
In recently years, a significant amount of research has been conducted on applying deep learning methods for glaucoma classification and detection. However, the explainability of those established machine learning models remains a big concern. In this research, in contrast, we learn from cognitive science concept and study how ophthalmologists judge glaucoma detection. Simulating experts' efforts, we propose a hierarchical decision making system, centered around a holistic set of carefully designed biomarker-oriented machine learning models. While biomarkers represent the key indicators of how ophthalmologists identify glaucoma, they usually exhibit latent inter-relations. We thus construct a time series model, named TRI-LSTM, capable of calculating and uncovering potential and latent relationships among various biomarkers of glaucoma. Our model is among the first efforts to explore the intrinsic connections among glaucoma biomarkers. We monitor temporal relationships in patients' disease states over time and to capture and retain the progression of disease-relevant clinical information from prior visits, thereby enriching biomarker's potential relationships. Extensive experiments over real-world dataset have demonstrated the effectiveness of the proposed model.
comment: 9 pages, 4 images
☆ A Novel Denoising Technique and Deep Learning Based Hybrid Wind Speed Forecasting Model for Variable Terrain Conditions
Wind flow can be highly unpredictable and can suffer substantial fluctuations in speed and direction due to the shape and height of hills, mountains, and valleys, making accurate wind speed (WS) forecasting essential in complex terrain. This paper presents a novel and adaptive model for short-term forecasting of WS. The paper's key contributions are as follows: (a) The Partial Auto Correlation Function (PACF) is utilised to minimise the dimension of the set of Intrinsic Mode Functions (IMF), hence reducing training time; (b) The sample entropy (SampEn) was used to calculate the complexity of the reduced set of IMFs. The proposed technique is adaptive since a specific Deep Learning (DL) model-feature combination was chosen based on complexity; (c) A novel bidirectional feature-LSTM framework for complicated IMFs has been suggested, resulting in improved forecasting accuracy; (d) The proposed model shows superior forecasting performance compared to the persistence, hybrid, Ensemble empirical mode decomposition (EEMD), and Variational Mode Decomposition (VMD)-based deep learning models. It has achieved the lowest variance in terms of forecasting accuracy between simple and complex terrain conditions 0.70%. Dimension reduction of IMF's and complexity-based model-feature selection helps reduce the training time by 68.77% and improve forecasting quality by 58.58% on average.
☆ SciLitLLM: How to Adapt LLMs for Scientific Literature Understanding
Scientific literature understanding is crucial for extracting targeted information and garnering insights, thereby significantly advancing scientific discovery. Despite the remarkable success of Large Language Models (LLMs), they face challenges in scientific literature understanding, primarily due to (1) a lack of scientific knowledge and (2) unfamiliarity with specialized scientific tasks. To develop an LLM specialized in scientific literature understanding, we propose a hybrid strategy that integrates continual pre-training (CPT) and supervised fine-tuning (SFT), to simultaneously infuse scientific domain knowledge and enhance instruction-following capabilities for domain-specific tasks.cIn this process, we identify two key challenges: (1) constructing high-quality CPT corpora, and (2) generating diverse SFT instructions. We address these challenges through a meticulous pipeline, including PDF text extraction, parsing content error correction, quality filtering, and synthetic instruction creation. Applying this strategy, we present a suite of LLMs: SciLitLLM, specialized in scientific literature understanding. These models demonstrate promising performance on scientific literature understanding benchmarks. Our contributions are threefold: (1) We present an effective framework that integrates CPT and SFT to adapt LLMs to scientific literature understanding, which can also be easily adapted to other domains. (2) We propose an LLM-based synthesis method to generate diverse and high-quality scientific instructions, resulting in a new instruction set -- SciLitIns -- for supervised fine-tuning in less-represented scientific domains. (3) SciLitLLM achieves promising performance improvements on scientific literature understanding benchmarks.
☆ Improving Thompson Sampling via Information Relaxation for Budgeted Multi-armed Bandits
We consider a Bayesian budgeted multi-armed bandit problem, in which each arm consumes a different amount of resources when selected and there is a budget constraint on the total amount of resources that can be used. Budgeted Thompson Sampling (BTS) offers a very effective heuristic to this problem, but its arm-selection rule does not take into account the remaining budget information. We adopt \textit{Information Relaxation Sampling} framework that generalizes Thompson Sampling for classical $K$-armed bandit problems, and propose a series of algorithms that are randomized like BTS but more carefully optimize their decisions with respect to the budget constraint. In a one-to-one correspondence with these algorithms, a series of performance benchmarks that improve the conventional benchmark are also suggested. Our theoretical analysis and simulation results show that our algorithms (and our benchmarks) make incremental improvements over BTS (respectively, the conventional benchmark) across various settings including a real-world example.
comment: accepted
☆ Measuring the Reliability of Causal Probing Methods: Tradeoffs, Limitations, and the Plight of Nullifying Interventions
Causal probing is an approach to interpreting foundation models, such as large language models, by training probes to recognize latent properties of interest from embeddings, intervening on probes to modify this representation, and analyzing the resulting changes in the model's behavior. While some recent works have cast doubt on the theoretical basis of several leading causal probing intervention methods, it has been unclear how to systematically and empirically evaluate their effectiveness in practice. To address this problem, we propose a general empirical analysis framework to evaluate the reliability of causal probing interventions, formally defining and quantifying two key causal probing desiderata: completeness (fully transforming the representation of the target property) and selectivity (minimally impacting other properties). Our formalism allows us to make the first direct comparisons between different families of causal probing methods (e.g., linear vs. nonlinear or counterfactual vs. nullifying interventions). We conduct extensive experiments across several leading methods, finding that (1) there is an inherent tradeoff between these criteria, and no method is able to consistently satisfy both at once; and (2) across the board, nullifying interventions are always far less complete than counterfactual interventions, indicating that nullifying methods may not be an effective approach to causal probing.
☆ MODULI: Unlocking Preference Generalization via Diffusion Models for Offline Multi-Objective Reinforcement Learning
Multi-objective Reinforcement Learning (MORL) seeks to develop policies that simultaneously optimize multiple conflicting objectives, but it requires extensive online interactions. Offline MORL provides a promising solution by training on pre-collected datasets to generalize to any preference upon deployment. However, real-world offline datasets are often conservatively and narrowly distributed, failing to comprehensively cover preferences, leading to the emergence of out-of-distribution (OOD) preference areas. Existing offline MORL algorithms exhibit poor generalization to OOD preferences, resulting in policies that do not align with preferences. Leveraging the excellent expressive and generalization capabilities of diffusion models, we propose MODULI (Multi-objective Diffusion Planner with Sliding Guidance), which employs a preference-conditioned diffusion model as a planner to generate trajectories that align with various preferences and derive action for decision-making. To achieve accurate generation, MODULI introduces two return normalization methods under diverse preferences for refining guidance. To further enhance generalization to OOD preferences, MODULI proposes a novel sliding guidance mechanism, which involves training an additional slider adapter to capture the direction of preference changes. Incorporating the slider, it transitions from in-distribution (ID) preferences to generating OOD preferences, patching, and extending the incomplete Pareto front. Extensive experiments on the D4MORL benchmark demonstrate that our algorithm outperforms state-of-the-art Offline MORL baselines, exhibiting excellent generalization to OOD preferences.
comment: 23 pages, 7 figures
☆ Deep Learning to Predict Late-Onset Breast Cancer Metastasis: the Single Hyperparameter Grid Search (SHGS) Strategy for Meta Tuning Concerning Deep Feed-forward Neural Network
While machine learning has advanced in medicine, its widespread use in clinical applications, especially in predicting breast cancer metastasis, is still limited. We have been dedicated to constructing a DFNN model to predict breast cancer metastasis n years in advance. However, the challenge lies in efficiently identifying optimal hyperparameter values through grid search, given the constraints of time and resources. Issues such as the infinite possibilities for continuous hyperparameters like l1 and l2, as well as the time-consuming and costly process, further complicate the task. To address these challenges, we developed Single Hyperparameter Grid Search (SHGS) strategy, serving as a preselection method before grid search. Our experiments with SHGS applied to DFNN models for breast cancer metastasis prediction focus on analyzing eight target hyperparameters: epochs, batch size, dropout, L1, L2, learning rate, decay, and momentum. We created three figures, each depicting the experiment results obtained from three LSM-I-10-Plus-year datasets. These figures illustrate the relationship between model performance and the target hyperparameter values. For each hyperparameter, we analyzed whether changes in this hyperparameter would affect model performance, examined if there were specific patterns, and explored how to choose values for the particular hyperparameter. Our experimental findings reveal that the optimal value of a hyperparameter is not only dependent on the dataset but is also significantly influenced by the settings of other hyperparameters. Additionally, our experiments suggested some reduced range of values for a target hyperparameter, which may be helpful for low-budget grid search. This approach serves as a prior experience and foundation for subsequent use of grid search to enhance model performance.
☆ Remove Symmetries to Control Model Expressivity
When symmetry is present in the loss function, the model is likely to be trapped in a low-capacity state that is sometimes known as a "collapse." Being trapped in these low-capacity states can be a major obstacle to training across many scenarios where deep learning technology is applied. We first prove two concrete mechanisms through which symmetries lead to reduced capacities and ignored features during training. We then propose a simple and theoretically justified algorithm, syre, to remove almost all symmetry-induced low-capacity states in neural networks. The proposed method is shown to improve the training of neural networks in scenarios when this type of entrapment is especially a concern. A remarkable merit of the proposed method is that it is model-agnostic and does not require any knowledge of the symmetry.
comment: preprint
☆ CTRQNets & LQNets: Continuous Time Recurrent and Liquid Quantum Neural Networks
Neural networks have continued to gain prevalence in the modern era for their ability to model complex data through pattern recognition and behavior remodeling. However, the static construction of traditional neural networks inhibits dynamic intelligence. This makes them inflexible to temporal changes in data and unfit to capture complex dependencies. With the advent of quantum technology, there has been significant progress in creating quantum algorithms. In recent years, researchers have developed quantum neural networks that leverage the capabilities of qubits to outperform classical networks. However, their current formulation exhibits a static construction limiting the system's dynamic intelligence. To address these weaknesses, we develop a Liquid Quantum Neural Network (LQNet) and a Continuous Time Recurrent Quantum Neural Network (CTRQNet). Both models demonstrate a significant improvement in accuracy compared to existing quantum neural networks (QNNs), achieving accuracy increases as high as 40\% on CIFAR 10 through binary classification. We propose LQNets and CTRQNets might shine a light on quantum machine learning's black box.
☆ PersonalizedUS: Interpretable Breast Cancer Risk Assessment with Local Coverage Uncertainty Quantification
Correctly assessing the malignancy of breast lesions identified during ultrasound examinations is crucial for effective clinical decision-making. However, the current "golden standard" relies on manual BI-RADS scoring by clinicians, often leading to unnecessary biopsies and a significant mental health burden on patients and their families. In this paper, we introduce PersonalizedUS, an interpretable machine learning system that leverages recent advances in conformal prediction to provide precise and personalized risk estimates with local coverage guarantees and sensitivity, specificity, and predictive values above 0.9 across various threshold levels. In particular, we identify meaningful lesion subgroups where distribution-free, model-agnostic conditional coverage holds, with approximately 90% of our prediction sets containing only the ground truth in most lesion subgroups, thus explicitly characterizing for which patients the model is most suitably applied. Moreover, we make available a curated tabular dataset of 1936 biopsied breast lesions from a recent observational multicenter study and benchmark the performance of several state-of-the-art learning algorithms. We also report a successful case study of the deployed system in the same multicenter context. Concrete clinical benefits include up to a 65% reduction in requested biopsies among BI-RADS 4a and 4b lesions, with minimal to no missed cancer cases.
comment: 9 pages, 5 figure, 2 tables
☆ Certified Causal Defense with Generalizable Robustness AAAI
While machine learning models have proven effective across various scenarios, it is widely acknowledged that many models are vulnerable to adversarial attacks. Recently, there have emerged numerous efforts in adversarial defense. Among them, certified defense is well known for its theoretical guarantees against arbitrary adversarial perturbations on input within a certain range (e.g., $l_2$ ball). However, most existing works in this line struggle to generalize their certified robustness in other data domains with distribution shifts. This issue is rooted in the difficulty of eliminating the negative impact of spurious correlations on robustness in different domains. To address this problem, in this work, we propose a novel certified defense framework GLEAN, which incorporates a causal perspective into the generalization problem in certified defense. More specifically, our framework integrates a certifiable causal factor learning component to disentangle the causal relations and spurious correlations between input and label, and thereby exclude the negative effect of spurious correlations on defense. On top of that, we design a causally certified defense strategy to handle adversarial attacks on latent causal factors. In this way, our framework is not only robust against malicious noises on data in the training distribution but also can generalize its robustness across domains with distribution shifts. Extensive experiments on benchmark datasets validate the superiority of our framework in certified robustness generalization in different data domains. Code is available in the supplementary materials.
comment: Submitted to AAAI
☆ Avoiding Generative Model Writer's Block With Embedding Nudging
Generative image models, since introduction, have become a global phenomenon. From new arts becoming possible to new vectors of abuse, many new capabilities have become available. One of the challenging issues with generative models is controlling the generation process specially to prevent specific generations classes or instances . There are several reasons why one may want to control the output of generative models, ranging from privacy and safety concerns to application limitations or user preferences To address memorization and privacy challenges, there has been considerable research dedicated to filtering prompts or filtering the outputs of these models. What all these solutions have in common is that at the end of the day they stop the model from producing anything, hence limiting the usability of the model. In this paper, we propose a method for addressing this usability issue by making it possible to steer away from unwanted concepts (when detected in model's output) and still generating outputs. In particular we focus on the latent diffusion image generative models and how one can prevent them to generate particular images while generating similar images with limited overhead. We focus on mitigating issues like image memorization, demonstrating our technique's effectiveness through qualitative and quantitative evaluations. Our method successfully prevents the generation of memorized training images while maintaining comparable image quality and relevance to the unmodified model.
☆ CardBench: A Benchmark for Learned Cardinality Estimation in Relational Databases
Cardinality estimation is crucial for enabling high query performance in relational databases. Recently learned cardinality estimation models have been proposed to improve accuracy but there is no systematic benchmark or datasets which allows researchers to evaluate the progress made by new learned approaches and even systematically develop new learned approaches. In this paper, we are releasing a benchmark, containing thousands of queries over 20 distinct real-world databases for learned cardinality estimation. In contrast to other initial benchmarks, our benchmark is much more diverse and can be used for training and testing learned models systematically. Using this benchmark, we explored whether learned cardinality estimation can be transferred to an unseen dataset in a zero-shot manner. We trained GNN-based and transformer-based models to study the problem in three setups: 1-) instance-based, 2-) zero-shot, and 3-) fine-tuned. Our results show that while we get promising results for zero-shot cardinality estimation on simple single table queries; as soon as we add joins, the accuracy drops. However, we show that with fine-tuning, we can still utilize pre-trained models for cardinality estimation, significantly reducing training overheads compared to instance specific models. We are open sourcing our scripts to collect statistics, generate queries and training datasets to foster more extensive research, also from the ML community on the important problem of cardinality estimation and in particular improve on recent directions such as pre-trained cardinality estimation.
☆ Simulating realistic short tandem repeat capillary electrophoretic signal using a generative adversarial network
DNA profiles are made up from multiple series of electrophoretic signal measuring fluorescence over time. Typically, human DNA analysts 'read' DNA profiles using their experience to distinguish instrument noise, artefactual signal, and signal corresponding to DNA fragments of interest. Recent work has developed an artificial neural network, ANN, to carry out the task of classifying fluorescence types into categories in DNA profile electrophoretic signal. But the creation of the necessarily large amount of labelled training data for the ANN is time consuming and expensive, and a limiting factor in the ability to robustly train the ANN. If realistic, prelabelled, training data could be simulated then this would remove the barrier to training an ANN with high efficacy. Here we develop a generative adversarial network, GAN, modified from the pix2pix GAN to achieve this task. With 1078 DNA profiles we train the GAN and achieve the ability to simulate DNA profile information, and then use the generator from the GAN as a 'realism filter' that applies the noise and artefact elements exhibited in typical electrophoretic signal.
comment: 29 pages, 9 Figures
☆ LeMON: Learning to Learn Multi-Operator Networks
Single-operator learning involves training a deep neural network to learn a specific operator, whereas recent work in multi-operator learning uses an operator embedding structure to train a single neural network on data from multiple operators. Thus, multi-operator learning is capable of predicting a range of operators within one model. In this work, we propose pretraining and fine-tuning strategies for solving PDEs using multi-operator learning. One key aspect is that by increasing the number of families of operators used in pretraining, a PDE foundation model can be fine-tuned to downstream tasks involving new PDEs with a limited number of samples, thus outperforming single operator neural networks. Specifically, a multi-operator learning model pre-trained with data from diverse PDE families can predict unseen operators after fine-tuning with only a limited number of operators from the new family, enabling them to serve as a data-free PDE solver. We also show that the proposed training and fine-tuning method is able to predict new operators in zero-shot prediction without samples. Additionally, we introduce a PDE-agnostic meta-learning algorithm to improve the adaptability of the model to various PDEs by providing a better parameter initialization process. To address the needs of applications with limited computing resources, we explore low-rank adaptation methods that reduce computational costs while enhancing solver accuracy. Lastly, by examining the scaling law with respect to the number of operator families, we establish and highlight its potential for broad adaptation in PDE-solving tasks.
☆ Free Lunch in the Forest: Functionally-Identical Pruning of Boosted Tree Ensembles
Tree ensembles, including boosting methods, are highly effective and widely used for tabular data. However, large ensembles lack interpretability and require longer inference times. We introduce a method to prune a tree ensemble into a reduced version that is "functionally identical" to the original model. In other words, our method guarantees that the prediction function stays unchanged for any possible input. As a consequence, this pruning algorithm is lossless for any aggregated metric. We formalize the problem of functionally identical pruning on ensembles, introduce an exact optimization model, and provide a fast yet highly effective method to prune large ensembles. Our algorithm iteratively prunes considering a finite set of points, which is incrementally augmented using an adversarial model. In multiple computational experiments, we show that our approach is a "free lunch", significantly reducing the ensemble size without altering the model's behavior. Thus, we can preserve state-of-the-art performance at a fraction of the original model's size.
☆ CLPNets: Coupled Lie-Poisson Neural Networks for Multi-Part Hamiltonian Systems with Symmetries
To accurately compute data-based prediction of Hamiltonian systems, especially the long-term evolution of such systems, it is essential to utilize methods that preserve the structure of the equations over time. We consider a case that is particularly challenging for data-based methods: systems with interacting parts that do not reduce to pure momentum evolution. Such systems are essential in scientific computations. For example, any discretization of a continuum elastic rod can be viewed as interacting elements that can move and rotate in space, with each discrete element moving on the group of rotations and translations $SE(3)$. We develop a novel method of data-based computation and complete phase space learning of such systems. We follow the original framework of \emph{SympNets} (Jin et al, 2020) building the neural network from canonical phase space mappings, and transformations that preserve the Lie-Poisson structure (\emph{LPNets}) as in (Eldred et al, 2024). We derive a novel system of mappings that are built into neural networks for coupled systems. We call such networks Coupled Lie-Poisson Neural Networks, or \emph{CLPNets}. We consider increasingly complex examples for the applications of CLPNets: rotation of two rigid bodies about a common axis, the free rotation of two rigid bodies, and finally the evolution of two connected and interacting $SE(3)$ components. Our method preserves all Casimir invariants of each system to machine precision, irrespective of the quality of the training data, and preserves energy to high accuracy. Our method also shows good resistance to the curse of dimensionality, requiring only a few thousand data points for all cases studied, with the effective dimension varying from three to eighteen. Additionally, the method is highly economical in memory requirements, requiring only about 200 parameters for the most complex case considered.
comment: 52 pages, 9 figures
☆ Does Data-Efficient Generalization Exacerbate Bias in Foundation Models? ECCV 2024
Foundation models have emerged as robust models with label efficiency in diverse domains. In medical imaging, these models contribute to the advancement of medical diagnoses due to the difficulty in obtaining labeled data. However, it is unclear whether using a large amount of unlabeled data, biased by the presence of sensitive attributes during pre-training, influences the fairness of the model. This research examines the bias in the Foundation model (RetFound) when it is applied to fine-tune the Brazilian Multilabel Ophthalmological Dataset (BRSET), which has a different population than the pre-training dataset. The model evaluation, in comparison with supervised learning, shows that the Foundation Model has the potential to reduce the gap between the maximum AUC and minimum AUC evaluations across gender and age groups. However, in a data-efficient generalization, the model increases the bias when the data amount decreases. These findings suggest that when deploying a Foundation Model in real-life scenarios with limited data, the possibility of fairness issues should be considered.
comment: Preprint of paper to be presented at Fairness and Ethics Towards Transparent AI: Facing the Challenge through Model Debiasing (FAILED) during ECCV 2024
☆ Improving the Prediction of Individual Engagement in Recommendations Using Cognitive Models
For public health programs with limited resources, the ability to predict how behaviors change over time and in response to interventions is crucial for deciding when and to whom interventions should be allocated. Using data from a real-world maternal health program, we demonstrate how a cognitive model based on Instance-Based Learning (IBL) Theory can augment existing purely computational approaches. Our findings show that, compared to general time-series forecasters (e.g., LSTMs), IBL models, which reflect human decision-making processes, better predict the dynamics of individuals' states. Additionally, IBL provides estimates of the volatility in individuals' states and their sensitivity to interventions, which can improve the efficiency of training of other time series models.
♻ ☆ Embedded FPGA Developments in 130nm and 28nm CMOS for Machine Learning in Particle Detector Readout
Embedded field programmable gate array (eFPGA) technology allows the implementation of reconfigurable logic within the design of an application-specific integrated circuit (ASIC). This approach offers the low power and efficiency of an ASIC along with the ease of FPGA configuration, particularly beneficial for the use case of machine learning in the data pipeline of next-generation collider experiments. An open-source framework called "FABulous" was used to design eFPGAs using 130 nm and 28 nm CMOS technology nodes, which were subsequently fabricated and verified through testing. The capability of an eFPGA to act as a front-end readout chip was assessed using simulation of high energy particles passing through a silicon pixel sensor. A machine learning-based classifier, designed for reduction of sensor data at the source, was synthesized and configured onto the eFPGA. A successful proof-of-concept was demonstrated through reproduction of the expected algorithm result on the eFPGA with perfect accuracy. Further development of the eFPGA technology and its application to collider detector readout is discussed.
comment: 16 pages, 12 figures
♻ ☆ Flextron: Many-in-One Flexible Large Language Model
Training modern LLMs is extremely resource intensive, and customizing them for various deployment scenarios characterized by limited compute and memory resources through repeated training is impractical. In this paper, we introduce Flextron, a network architecture and post-training model optimization framework supporting flexible model deployment. The Flextron architecture utilizes a nested elastic structure to rapidly adapt to specific user-defined latency and accuracy targets during inference with no additional fine-tuning required. It is also input-adaptive, and can automatically route tokens through its sub-networks for improved performance and efficiency. We present a sample-efficient training method and associated routing algorithms for systematically transforming an existing trained LLM into a Flextron model. We evaluate Flextron on the GPT-3 and LLama-2 family of LLMs, and demonstrate superior performance over multiple end-to-end trained variants and other state-of-the-art elastic networks, all with a single pretraining run that consumes a mere 7.63% tokens compared to original pretraining.
♻ ☆ Examining Pathological Bias in a Generative Adversarial Network Discriminator: A Case Study on a StyleGAN3 Model
Generative adversarial networks (GANs) generate photorealistic faces that are often indistinguishable by humans from real faces. While biases in machine learning models are often assumed to be due to biases in training data, we find pathological internal color and luminance biases in the discriminator of a pre-trained StyleGAN3-r model that are not explicable by the training data. We also find that the discriminator systematically stratifies scores by both image- and face-level qualities and that this disproportionately affects images across gender, race, and other categories. We examine axes common in research on stereotyping in social psychology.
♻ ☆ GINN-KAN: Interpretability pipelining with applications in Physics Informed Neural Networks
Neural networks are powerful function approximators, yet their ``black-box" nature often renders them opaque and difficult to interpret. While many post-hoc explanation methods exist, they typically fail to capture the underlying reasoning processes of the networks. A truly interpretable neural network would be trained similarly to conventional models using techniques such as backpropagation, but additionally provide insights into the learned input-output relationships. In this work, we introduce the concept of interpretability pipelineing, to incorporate multiple interpretability techniques to outperform each individual technique. To this end, we first evaluate several architectures that promise such interpretability, with a particular focus on two recent models selected for their potential to incorporate interpretability into standard neural network architectures while still leveraging backpropagation: the Growing Interpretable Neural Network (GINN) and Kolmogorov Arnold Networks (KAN). We analyze the limitations and strengths of each and introduce a novel interpretable neural network GINN-KAN that synthesizes the advantages of both models. When tested on the Feynman symbolic regression benchmark datasets, GINN-KAN outperforms both GINN and KAN. To highlight the capabilities and the generalizability of this approach, we position GINN-KAN as an alternative to conventional black-box networks in Physics-Informed Neural Networks (PINNs). We expect this to have far-reaching implications in the application of deep learning pipelines in the natural sciences. Our experiments with this interpretable PINN on 15 different partial differential equations demonstrate that GINN-KAN augmented PINNs outperform PINNs with black-box networks in solving differential equations and surpass the capabilities of both GINN and KAN.
♻ ☆ A Deep Learning Based Resource Allocator for Communication Systems with Dynamic User Utility Demands
Deep learning (DL) based resource allocation (RA) has recently gained significant attention due to its performance efficiency. However, most related studies assume an ideal case where the number of users and their utility demands, e.g., data rate constraints, are fixed, and the designed DL-based RA scheme exploits a policy trained only for these fixed parameters. Consequently, computationally complex policy retraining is required whenever these parameters change. In this paper, we introduce a DL-based resource allocator (ALCOR) that allows users to adjust their utility demands freely, such as based on their application layer requirements. ALCOR employs deep neural networks (DNNs) as the policy in a time-sharing problem. The underlying optimization algorithm iteratively optimizes the on-off status of users to satisfy their utility demands in expectation. The policy performs unconstrained RA (URA)--RA without considering user utility demands--among active users to maximize the sum utility (SU) at each time instant. Depending on the chosen URA scheme, ALCOR can perform RA in either a centralized or distributed scenario. Derived convergence analyses provide guarantees for ALCOR's convergence, and numerical experiments corroborate its effectiveness.
♻ ☆ Geometric Neural Network based on Phase Space for BCI-EEG decoding
Objective: The integration of Deep Learning (DL) algorithms on brain signal analysis is still in its nascent stages compared to their success in fields like Computer Vision. This is particularly true for BCI, where the brain activity is decoded to control external devices without requiring muscle control. Electroencephalography (EEG) is a widely adopted choice for designing BCI systems due to its non-invasive and cost-effective nature and excellent temporal resolution. Still, it comes at the expense of limited training data, poor signal-to-noise, and a large variability across and within-subject recordings. Finally, setting up a BCI system with many electrodes takes a long time, hindering the widespread adoption of reliable DL architectures in BCIs outside research laboratories. To improve adoption, we need to improve user comfort using, for instance, reliable algorithms that operate with few electrodes. Approach: Our research aims to develop a DL algorithm that delivers effective results with a limited number of electrodes. Taking advantage of the Augmented Covariance Method and the framework of SPDNet, we propose the Phase-SPDNet architecture and analyze its performance and the interpretability of the results. The evaluation is conducted on 5-fold cross-validation, using only three electrodes positioned above the Motor Cortex. The methodology was tested on nearly 100 subjects from several open-source datasets using the Mother Of All BCI Benchmark (MOABB) framework. Main results: The results of our Phase-SPDNet demonstrate that the augmented approach combined with the SPDNet significantly outperforms all the current state-of-the-art DL architecture in MI decoding. Significance: This new architecture is explainable and with a low number of trainable parameters.
♻ ☆ On-Device Training of Fully Quantized Deep Neural Networks on Cortex-M Microcontrollers
On-device training of DNNs allows models to adapt and fine-tune to newly collected data or changing domains while deployed on microcontroller units (MCUs). However, DNN training is a resource-intensive task, making the implementation and execution of DNN training algorithms on MCUs challenging due to low processor speeds, constrained throughput, limited floating-point support, and memory constraints. In this work, we explore on-device training of DNNs for Cortex-M MCUs. We present a method that enables efficient training of DNNs completely in place on the MCU using fully quantized training (FQT) and dynamic partial gradient updates. We demonstrate the feasibility of our approach on multiple vision and time-series datasets and provide insights into the tradeoff between training accuracy, memory overhead, energy, and latency on real hardware.
comment: 12 pages, 9 figures
♻ ☆ Provable Probabilistic Imaging using Score-Based Generative Priors
Estimating high-quality images while also quantifying their uncertainty are two desired features in an image reconstruction algorithm for solving ill-posed inverse problems. In this paper, we propose plug-and-play Monte Carlo (PMC) as a principled framework for characterizing the space of possible solutions to a general inverse problem. PMC is able to incorporate expressive score-based generative priors for high-quality image reconstruction while also performing uncertainty quantification via posterior sampling. In particular, we develop two PMC algorithms that can be viewed as the sampling analogues of the traditional plug-and-play priors (PnP) and regularization by denoising (RED) algorithms. To improve the sampling efficiency, we introduce weighted annealing into these PMC algorithms, further developing two additional annealed PMC algorithms (APMC). We establish a theoretical analysis for characterizing the convergence behavior of PMC algorithms. Our analysis provides non-asymptotic stationarity guarantees in terms of the Fisher information, fully compatible with the joint presence of weighted annealing, potentially non-log-concave likelihoods, and imperfect score networks. We demonstrate the performance of the PMC algorithms on multiple representative inverse problems with both linear and nonlinear forward models. Experimental results show that PMC significantly improves reconstruction quality and enables high-fidelity uncertainty quantification.
♻ ☆ Correlation recurrent units: A novel neural architecture for improving the predictive performance of time-series data
The time-series forecasting (TSF) problem is a traditional problem in the field of artificial intelligence. Models such as Recurrent Neural Network (RNN), Long Short Term Memory (LSTM), and GRU (Gate Recurrent Units) have contributed to improving the predictive accuracy of TSF. Furthermore, model structures have been proposed to combine time-series decomposition methods, such as seasonal-trend decomposition using Loess (STL) to ensure improved predictive accuracy. However, because this approach is learned in an independent model for each component, it cannot learn the relationships between time-series components. In this study, we propose a new neural architecture called a correlation recurrent unit (CRU) that can perform time series decomposition within a neural cell and learn correlations (autocorrelation and correlation) between each decomposition component. The proposed neural architecture was evaluated through comparative experiments with previous studies using five univariate time-series datasets and four multivariate time-series data. The results showed that long- and short-term predictive performance was improved by more than 10%. The experimental results show that the proposed CRU is an excellent method for TSF problems compared to other neural architectures.
♻ ☆ RecurrentGemma: Moving Past Transformers for Efficient Open Language Models
We introduce RecurrentGemma, a family of open language models which uses Google's novel Griffin architecture. Griffin combines linear recurrences with local attention to achieve excellent performance on language. It has a fixed-sized state, which reduces memory use and enables efficient inference on long sequences. We provide two sizes of models, containing 2B and 9B parameters, and provide pre-trained and instruction tuned variants for both. Our models achieve comparable performance to similarly-sized Gemma baselines despite being trained on fewer tokens.
♻ ☆ A Statistical Framework of Watermarks for Large Language Models: Pivot, Detection Efficiency and Optimal Rules
Since ChatGPT was introduced in November 2022, embedding (nearly) unnoticeable statistical signals into text generated by large language models (LLMs), also known as watermarking, has been used as a principled approach to provable detection of LLM-generated text from its human-written counterpart. In this paper, we introduce a general and flexible framework for reasoning about the statistical efficiency of watermarks and designing powerful detection rules. Inspired by the hypothesis testing formulation of watermark detection, our framework starts by selecting a pivotal statistic of the text and a secret key -- provided by the LLM to the verifier -- to enable controlling the false positive rate (the error of mistakenly detecting human-written text as LLM-generated). Next, this framework allows one to evaluate the power of watermark detection rules by obtaining a closed-form expression of the asymptotic false negative rate (the error of incorrectly classifying LLM-generated text as human-written). Our framework further reduces the problem of determining the optimal detection rule to solving a minimax optimization program. We apply this framework to two representative watermarks -- one of which has been internally implemented at OpenAI -- and obtain several findings that can be instrumental in guiding the practice of implementing watermarks. In particular, we derive optimal detection rules for these watermarks under our framework. These theoretically derived detection rules are demonstrated to be competitive and sometimes enjoy a higher power than existing detection approaches through numerical experiments.
♻ ☆ Guaranteed Coverage Prediction Intervals with Gaussian Process Regression
Gaussian Process Regression (GPR) is a popular regression method, which unlike most Machine Learning techniques, provides estimates of uncertainty for its predictions. These uncertainty estimates however, are based on the assumption that the model is well-specified, an assumption that is violated in most practical applications, since the required knowledge is rarely available. As a result, the produced uncertainty estimates can become very misleading; for example the prediction intervals (PIs) produced for the 95% confidence level may cover much less than 95% of the true labels. To address this issue, this paper introduces an extension of GPR based on a Machine Learning framework called, Conformal Prediction (CP). This extension guarantees the production of PIs with the required coverage even when the model is completely misspecified. The proposed approach combines the advantages of GPR with the valid coverage guarantee of CP, while the performed experimental results demonstrate its superiority over existing methods.
comment: 12 pages. This article has been accepted for publication in IEEE Transactions on Pattern Analysis and Machine Intelligence. This is the author's version which has not been fully edited and content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2024.3418214
♻ ☆ FRANC: A Lightweight Framework for High-Quality Code Generation SC
In recent years, the use of automated source code generation utilizing transformer-based generative models has expanded, and these models can generate functional code according to the requirements of the developers. However, recent research revealed that these automatically generated source codes can contain vulnerabilities and other quality issues. Despite researchers' and practitioners' attempts to enhance code generation models, retraining and fine-tuning large language models is time-consuming and resource-intensive. Thus, we describe FRANC, a lightweight framework for recommending more secure and high-quality source code derived from transformer-based code generation models. FRANC includes a static filter to make the generated code compilable with heuristics and a quality-aware ranker to sort the code snippets based on a quality score. Moreover, the framework uses prompt engineering to fix persistent quality issues. We evaluated the framework with five Python and Java code generation models and six prompt datasets, including a newly created one in this work (SOEval). The static filter improves 9% to 46% Java suggestions and 10% to 43% Python suggestions regarding compilability. The average improvement over the NDCG@10 score for the ranking system is 0.0763, and the repairing techniques repair the highest 80% of prompts. FRANC takes, on average, 1.98 seconds for Java; for Python, it takes 0.08 seconds.
comment: Accepted at the 24th IEEE International Conference on Source Code Analysis and Manipulation (SCAM 2024)
♻ ☆ The Fault in our Stars: Quality Assessment of Code Generation Benchmarks SC
Large Language Models (LLMs) are gaining popularity among software engineers. A crucial aspect of developing effective code generation LLMs is to evaluate these models using a robust benchmark. Evaluation benchmarks with quality issues can provide a false sense of performance. In this work, we conduct the first-of-its-kind study of the quality of prompts within benchmarks used to compare the performance of different code generation models. To conduct this study, we analyzed 3,566 prompts from 9 code generation benchmarks to identify quality issues in them. We also investigated whether fixing the identified quality issues in the benchmarks' prompts affects a model's performance. We also studied memorization issues of the evaluation dataset, which can put into question a benchmark's trustworthiness. We found that code generation evaluation benchmarks mainly focused on Python and coding exercises and had very limited contextual dependencies to challenge the model. These datasets and the developers' prompts suffer from quality issues like spelling and grammatical errors, unclear sentences to express developers' intent, and not using proper documentation style. Fixing all these issues in the benchmarks can lead to a better performance for Python code generation, but not a significant improvement was observed for Java code generation. We also found evidence that GPT-3.5-Turbo and CodeGen-2.5 models may have data contamination issues.
comment: Accepted at the 24th IEEE International Conference on Source Code Analysis and Manipulation(SCAM 2024)
♻ ☆ Unveiling the Statistical Foundations of Chain-of-Thought Prompting Methods
Chain-of-Thought (CoT) prompting and its variants have gained popularity as effective methods for solving multi-step reasoning problems using pretrained large language models (LLMs). In this work, we analyze CoT prompting from a statistical estimation perspective, providing a comprehensive characterization of its sample complexity. To this end, we introduce a multi-step latent variable model that encapsulates the reasoning process, where the latent variable encodes the task information. Under this framework, we demonstrate that when the pretraining dataset is sufficiently large, the estimator formed by CoT prompting is equivalent to a Bayesian estimator. This estimator effectively solves the multi-step reasoning problem by aggregating a posterior distribution inferred from the demonstration examples in the prompt. Moreover, we prove that the statistical error of the CoT estimator can be decomposed into two main components: (i) a prompting error, which arises from inferring the true task using CoT prompts, and (ii) the statistical error of the pretrained LLM. We establish that, under appropriate assumptions, the prompting error decays exponentially to zero as the number of demonstrations increases. Additionally, we explicitly characterize the approximation and generalization errors of the pretrained LLM. Notably, we construct a transformer model that approximates the target distribution of the multi-step reasoning problem with an error that decreases exponentially in the number of transformer blocks. Our analysis extends to other variants of CoT, including Self-Consistent CoT, Tree-of-Thought, and Selection-Inference, offering a broad perspective on the efficacy of these methods. We also provide numerical experiments to validate the theoretical findings.
comment: 150 pages, 18 figures, 3 tables
♻ ☆ A Framework to Model ML Engineering Processes
The development of Machine Learning (ML) based systems is complex and requires multidisciplinary teams with diverse skill sets. This may lead to communication issues or misapplication of best practices. Process models can alleviate these challenges by standardizing task orchestration, providing a common language to facilitate communication, and nurturing a collaborative environment. Unfortunately, current process modeling languages are not suitable for describing the development of such systems. In this paper, we introduce a framework for modeling ML-based software development processes, built around a domain-specific language and derived from an analysis of scientific and gray literature. A supporting toolkit is also available.
♻ ☆ Stick to your Role! Stability of Personal Values Expressed in Large Language Models
The standard way to study Large Language Models (LLMs) with benchmarks or psychology questionnaires is to provide many different queries from similar minimal contexts (e.g. multiple choice questions). However, due to LLMs' highly context-dependent nature, conclusions from such minimal-context evaluations may be little informative about the model's behavior in deployment (where it will be exposed to many new contexts). We argue that context-dependence (specifically, value stability) should be studied as a specific property of LLMs and used as another dimension of LLM comparison (alongside others such as cognitive abilities, knowledge, or model size). We present a case-study on the stability of value expression over different contexts (simulated conversations on different topics) as measured using a standard psychology questionnaire (PVQ) and on behavioral downstream tasks. Reusing methods from psychology, we study Rank-order stability on the population (interpersonal) level, and Ipsative stability on the individual (intrapersonal) level. We consider two settings (with and without instructing LLMs to simulate particular personas), two simulated populations, and three downstream tasks. We observe consistent trends in the stability of models and model families - Mixtral, Mistral, GPT-3.5 and Qwen families are more stable than LLaMa-2 and Phi. The consistency of these trends implies that some models exhibit higher value stability than others, and that stability can be estimated with the set of introduced methodological tools. When instructed to simulate particular personas, LLMs exhibit low Rank-order stability, which further diminishes with conversation length. This highlights the need for future research on LLMs that coherently simulate different personas. This paper provides a foundational step in that direction, and, to our knowledge, it is the first study of value stability in LLMs.
comment: The project website and code are available at https://sites.google.com/view/llmvaluestability Published in PLOS ONE ( https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0309114 ), and a shorter version at CogSci 24 ( https://escholarship.org/uc/item/7w4823c6 )
♻ ☆ A Metric-based Principal Curve Approach for Learning One-dimensional Manifold
Principal curve is a well-known statistical method oriented in manifold learning using concepts from differential geometry. In this paper, we propose a novel metric-based principal curve (MPC) method that learns one-dimensional manifold of spatial data. Synthetic datasets Real applications using MNIST dataset show that our method can learn the one-dimensional manifold well in terms of the shape.
♻ ☆ Marked Neural Spatio-Temporal Point Process Involving a Dynamic Graph Neural Network
Temporal Point Processes (TPPs) have recently become increasingly interesting for learning dynamics in graph data. A reason for this is that learning on dynamic graph data is becoming more relevant, since data from many scientific fields, ranging from mathematics, biology, social sciences, and physics to computer science, is naturally related and inherently dynamic. In addition, TPPs provide a meaningful characterization of event streams and a prediction mechanism for future events. Therefore, (semi-)parameterized Neural TPPs have been introduced whose characterization can be (partially) learned and, thus, enable the representation of more complex phenomena. However, the research on modeling dynamic graphs with TPPs is relatively young, and only a few models for node attribute changes or evolving edges have been proposed yet. To allow for learning on fully dynamic graph streams, i.e., graphs that can change in their structure (addition/deletion of nodes/edge) and in their node/edge attributes, we propose a Marked Neural Spatio-Temporal Point Process (MNSTPP). It leverages a Dynamic Graph Neural Network to learn a Marked TPP that handles attributes and spatial data to model and predict any event in a graph stream.
♻ ☆ Analysis of Diagnostics (Part I): Prevalence, Uncertainty Quantification, and Machine Learning
Diagnostic testing provides a unique setting for studying and developing tools in classification theory. In such contexts, the concept of prevalence, i.e. the number of individuals with a given condition, is fundamental, both as an inherent quantity of interest and as a parameter that controls classification accuracy. This manuscript is the first in a two-part series that studies deeper connections between classification theory and prevalence, showing how the latter establishes a more complete theory of uncertainty quantification (UQ) for certain types of machine learning (ML). We motivate this analysis via a lemma demonstrating that general classifiers minimizing a prevalence-weighted error contain the same probabilistic information as Bayes-optimal classifiers, which depend on conditional probability densities. This leads us to study relative probability level-sets $B^\star (q)$, which are reinterpreted as both classification boundaries and useful tools for quantifying uncertainty in class labels. To realize this in practice, we also propose a numerical, homotopy algorithm that estimates the $B^\star (q)$ by minimizing a prevalence-weighted empirical error. The successes and shortcomings of this method motivate us to revisit properties of the level sets, and we deduce the corresponding classifiers obey a useful monotonicity property that stabilizes the numerics and points to important extensions to UQ of ML. Throughout, we validate our methods in the context of synthetic data and a research-use-only SARS-CoV-2 enzyme-linked immunosorbent (ELISA) assay.
♻ ☆ When Multi-Task Learning Meets Partial Supervision: A Computer Vision Review
Multi-Task Learning (MTL) aims to learn multiple tasks simultaneously while exploiting their mutual relationships. By using shared resources to simultaneously calculate multiple outputs, this learning paradigm has the potential to have lower memory requirements and inference times compared to the traditional approach of using separate methods for each task. Previous work in MTL has mainly focused on fully-supervised methods, as task relationships can not only be leveraged to lower the level of data-dependency of those methods but they can also improve performance. However, MTL introduces a set of challenges due to a complex optimisation scheme and a higher labeling requirement. This review focuses on how MTL could be utilised under different partial supervision settings to address these challenges. First, this review analyses how MTL traditionally uses different parameter sharing techniques to transfer knowledge in between tasks. Second, it presents the different challenges arising from such a multi-objective optimisation scheme. Third, it introduces how task groupings can be achieved by analysing task relationships. Fourth, it focuses on how partially supervised methods applied to MTL can tackle the aforementioned challenges. Lastly, this review presents the available datasets, tools and benchmarking results of such methods.
comment: Accepted by Proceedings of the IEEE
♻ ☆ QEDCartographer: Automating Formal Verification Using Reward-Free Reinforcement Learning ICSE
Formal verification is a promising method for producing reliable software, but the difficulty of manually writing verification proofs severely limits its utility in practice. Recent methods have automated some proof synthesis by guiding a search through the proof space using a theorem prover. Unfortunately, the theorem prover provides only the crudest estimate of progress, resulting in effectively undirected search. To address this problem, we create QEDCartographer, an automated proof-synthesis tool that combines supervised and reinforcement learning to more effectively explore the proof space. QEDCartographer incorporates the proofs' branching structure, enabling reward-free search and overcoming the sparse reward problem inherent to formal verification. We evaluate QEDCartographer using the CoqGym benchmark of 68.5K theorems from 124 open-source Coq projects. QEDCartographer fully automatically proves 21.4% of the test-set theorems. Previous search-based proof-synthesis tools Tok, Tac, ASTactic, Passport, and Proverbot9001, which rely only on supervised learning, prove 9.6%, 9.8%, 10.9%, 12.5%, and 19.8%, respectively. Diva, which combines 62 tools, proves 19.2%. Comparing to the most effective prior tool, Proverbot9001, QEDCartographer produces 26% shorter proofs 27% faster, on average over the theorems both tools prove. Together, QEDCartographer and non-learning-based CoqHammer prove 31.8% of the theorems, while CoqHammer alone proves 26.6%. Our work demonstrates that reinforcement learning is a fruitful research direction for improving proof-synthesis tools' search mechanisms.
comment: Published in the International Conference on Software Engineering (ICSE) 2025: Alex Sanchez-Stern, Abhishek Varghese, Zhanna Kaufman, Dylan Zhang, Talia Ringer, and Yuriy Brun, QEDCartographer: Automating Formal Verification Using Reward-Free Reinforcement Learning, in Proceedings of the 47th International Conference on Software Engineering (ICSE), 2025
♻ ☆ Research on the Spatial Data Intelligent Foundation Model
This report focuses on spatial data intelligent large models, delving into the principles, methods, and cutting-edge applications of these models. It provides an in-depth discussion on the definition, development history, current status, and trends of spatial data intelligent large models, as well as the challenges they face. The report systematically elucidates the key technologies of spatial data intelligent large models and their applications in urban environments, aerospace remote sensing, geography, transportation, and other scenarios. Additionally, it summarizes the latest application cases of spatial data intelligent large models in themes such as urban development, multimodal systems, remote sensing, smart transportation, and resource environments. Finally, the report concludes with an overview and outlook on the development prospects of spatial data intelligent large models.
comment: V1 and V2 are in Chinese language, other versions are in English
♻ ☆ FADE: Towards Fairness-aware Augmentation for Domain Generalization via Classifier-Guided Score-based Diffusion Models
Fairness-aware domain generalization (FairDG) has emerged as a critical challenge for deploying trustworthy AI systems, particularly in scenarios involving distribution shifts. Traditional methods for addressing fairness have failed in domain generalization due to their lack of consideration for distribution shifts. Although disentanglement has been used to tackle FairDG, it is limited by its strong assumptions. To overcome these limitations, we propose Fairness-aware Classifier-Guided Score-based Diffusion Models (FADE) as a novel approach to effectively address the FairDG issue. Specifically, we first pre-train a score-based diffusion model (SDM) and two classifiers to equip the model with strong generalization capabilities across different domains. Then, we guide the SDM using these pre-trained classifiers to effectively eliminate sensitive information from the generated data. Finally, the generated fair data is used to train downstream classifiers, ensuring robust performance under new data distributions. Extensive experiments on three real-world datasets demonstrate that FADE not only enhances fairness but also improves accuracy in the presence of distribution shifts. Additionally, FADE outperforms existing methods in achieving the best accuracy-fairness trade-offs.
♻ ☆ Re-Nerfing: Improving Novel View Synthesis through Novel View Synthesis
Recent neural rendering and reconstruction techniques, such as NeRFs or Gaussian Splatting, have shown remarkable novel view synthesis capabilities but require hundreds of images of the scene from diverse viewpoints to render high-quality novel views. With fewer images available, these methods start to fail since they can no longer correctly triangulate the underlying 3D geometry and converge to a non-optimal solution. These failures can manifest as floaters or blurry renderings in sparsely observed areas of the scene. In this paper, we propose Re-Nerfing, a simple and general add-on approach that leverages novel view synthesis itself to tackle this problem. Using an already trained NVS method, we render novel views between existing ones and augment the training data to optimize a second model. This introduces additional multi-view constraints and allows the second model to converge to a better solution. With Re-Nerfing we achieve significant improvements upon multiple pipelines based on NeRF and Gaussian-Splatting in sparse view settings of the mip-NeRF 360 and LLFF datasets. Notably, Re-Nerfing does not require prior knowledge or extra supervision signals, making it a flexible and practical add-on.
comment: Code will be released upon acceptance
♻ ☆ Contextual Bandit with Herding Effects: Algorithms and Recommendation Applications PRICAI 2024
Contextual bandits serve as a fundamental algorithmic framework for optimizing recommendation decisions online. Though extensive attention has been paid to tailoring contextual bandits for recommendation applications, the "herding effects" in user feedback have been ignored. These herding effects bias user feedback toward historical ratings, breaking down the assumption of unbiased feedback inherent in contextual bandits. This paper develops a novel variant of the contextual bandit that is tailored to address the feedback bias caused by the herding effects. A user feedback model is formulated to capture this feedback bias. We design the TS-Conf (Thompson Sampling under Conformity) algorithm, which employs posterior sampling to balance the exploration and exploitation tradeoff. We prove an upper bound for the regret of the algorithm, revealing the impact of herding effects on learning speed. Extensive experiments on datasets demonstrate that TS-Conf outperforms four benchmark algorithms. Analysis reveals that TS-Conf effectively mitigates the negative impact of herding effects, resulting in faster learning and improved recommendation accuracy.
comment: Published as a conference paper at PRICAI 2024
♻ ☆ Sensitivity-Aware Amortized Bayesian Inference
Sensitivity analyses reveal the influence of various modeling choices on the outcomes of statistical analyses. While theoretically appealing, they are overwhelmingly inefficient for complex Bayesian models. In this work, we propose sensitivity-aware amortized Bayesian inference (SA-ABI), a multifaceted approach to efficiently integrate sensitivity analyses into simulation-based inference with neural networks. First, we utilize weight sharing to encode the structural similarities between alternative likelihood and prior specifications in the training process with minimal computational overhead. Second, we leverage the rapid inference of neural networks to assess sensitivity to data perturbations and preprocessing steps. In contrast to most other Bayesian approaches, both steps circumvent the costly bottleneck of refitting the model for each choice of likelihood, prior, or data set. Finally, we propose to use deep ensembles to detect sensitivity arising from unreliable approximation (e.g., due to model misspecification). We demonstrate the effectiveness of our method in applied modeling problems, ranging from disease outbreak dynamics and global warming thresholds to human decision-making. Our results support sensitivity-aware inference as a default choice for amortized Bayesian workflows, automatically providing modelers with insights into otherwise hidden dimensions.
comment: Published in TMLR (2024)
♻ ☆ Articulation Work and Tinkering for Fairness in Machine Learning
The field of fair AI aims to counter biased algorithms through computational modelling. However, it faces increasing criticism for perpetuating the use of overly technical and reductionist methods. As a result, novel approaches appear in the field to address more socially-oriented and interdisciplinary (SOI) perspectives on fair AI. In this paper, we take this dynamic as the starting point to study the tension between computer science (CS) and SOI research. By drawing on STS and CSCW theory, we position fair AI research as a matter of 'organizational alignment': what makes research 'doable' is the successful alignment of three levels of work organization (the social world, the laboratory, and the experiment). Based on qualitative interviews with CS researchers, we analyze the tasks, resources, and actors required for doable research in the case of fair AI. We find that CS researchers engage with SOI research to some extent, but organizational conditions, articulation work, and ambiguities of the social world constrain the doability of SOI research for them. Based on our findings, we identify and discuss problems for aligning CS and SOI as fair AI continues to evolve.
♻ ☆ Forecasting Intraday Power Output by a Set of PV Systems using Recurrent Neural Networks and Physical Covariates
Accurate intraday forecasts of the power output by PhotoVoltaic (PV) systems are critical to improve the operation of energy distribution grids. We describe a neural autoregressive model that aims to perform such intraday forecasts. We build upon a physical, deterministic PV performance model, the output of which is used as covariates in the context of the neural model. In addition, our application data relates to a geographically distributed set of PV systems. We address all PV sites with a single neural model, which embeds the information about the PV site in specific covariates. We use a scale-free approach which relies on the explicit modeling of seasonal effects. Our proposal repurposes a model initially used in the retail sector and discloses a novel truncated Gaussian output distribution. An ablation study and a comparison to alternative architectures from the literature shows that the components in the best performing proposed model variant work synergistically to reach a skill score of 15.72% with respect to the physical model, used as a baseline.
comment: 25 pages, 7 figures, Accepted for publication in Neural Computing and Applications on 12/07/2024
♻ ☆ Language-specific Calibration for Pruning Multilingual Language Models
Recent advances in large language model (LLM) pruning have shown state-of-the-art compression results in post-training and retraining-free settings while maintaining high predictive performance. However, such research mainly considers calibrating pruning using English text, despite the multilingual nature of modern LLMs and their frequent uses in non-English languages. In this paper, we set out to explore effective strategies for calibrating the pruning of multilingual language models. We present the first comprehensive empirical study, comparing different calibration languages for pruning multilingual models across diverse tasks, models, and state-of-the-art pruning techniques. Our results present practical suggestions, for example, calibrating in the target language can efficiently yield lower perplexity, but does not necessarily benefit downstream tasks. Our further analysis experiments unveil that calibration in the target language mainly contributes to preserving language-specific features related to fluency and coherence, but might not contribute to capturing language-agnostic features such as language understanding and reasoning. Last, we provide practical recommendations for future practitioners.
♻ ☆ Causality-Aware Spatiotemporal Graph Neural Networks for Spatiotemporal Time Series Imputation CIKM'2024
Spatiotemporal time series are usually collected via monitoring sensors placed at different locations, which usually contain missing values due to various failures, such as mechanical damages and Internet outages. Imputing the missing values is crucial for analyzing time series. When recovering a specific data point, most existing methods consider all the information relevant to that point regardless of the cause-and-effect relationship. During data collection, it is inevitable that some unknown confounders are included, e.g., background noise in time series and non-causal shortcut edges in the constructed sensor network. These confounders could open backdoor paths and establish non-causal correlations between the input and output. Over-exploiting these non-causal correlations could cause overfitting. In this paper, we first revisit spatiotemporal time series imputation from a causal perspective and show how to block the confounders via the frontdoor adjustment. Based on the results of frontdoor adjustment, we introduce a novel Causality-Aware Spatiotemporal Graph Neural Network (Casper), which contains a novel Prompt Based Decoder (PBD) and a Spatiotemporal Causal Attention (SCA). PBD could reduce the impact of confounders and SCA could discover the sparse causal relationships among embeddings. Theoretical analysis reveals that SCA discovers causal relationships based on the values of gradients. We evaluate Casper on three real-world datasets, and the experimental results show that Casper could outperform the baselines and could effectively discover causal relationships.
comment: Accepted by CIKM'2024
♻ ☆ Inferring Individual Direct Causal Effects Under Heterogeneous Peer Influence
Causal inference in networks should account for interference, which occurs when a unit's outcome is influenced by treatments or outcomes of peers. Heterogeneous peer influence (HPI) occurs when a unit's outcome is influenced differently by different peers based on their attributes and relationships, or when each unit has a different susceptibility to peer influence. Existing solutions to estimating direct causal effects under interference consider either homogeneous influence from peers or specific heterogeneous influence mechanisms (e.g., based on local neighborhood structure). This paper presents a methodology for estimating individual direct causal effects in the presence of HPI where the mechanism of influence is not known a priori. We propose a structural causal model for networks that can capture different possible assumptions about network structure, interference conditions, and causal dependence and enables reasoning about identifiability in the presence of HPI. We find potential heterogeneous contexts using the causal model and propose a novel graph neural network-based estimator to estimate individual direct causal effects. We show that state-of-the-art methods for individual direct effect estimation produce biased results in the presence of HPI, and that our proposed estimator is robust.
♻ ☆ A Platform-Agnostic Deep Reinforcement Learning Framework for Effective Sim2Real Transfer towards Autonomous Driving
Deep Reinforcement Learning (DRL) has shown remarkable success in solving complex tasks across various research fields. However, transferring DRL agents to the real world is still challenging due to the significant discrepancies between simulation and reality. To address this issue, we propose a robust DRL framework that leverages platform-dependent perception modules to extract task-relevant information and train a lane-following and overtaking agent in simulation. This framework facilitates the seamless transfer of the DRL agent to new simulated environments and the real world with minimal effort. We evaluate the performance of the agent in various driving scenarios in both simulation and the real world, and compare it to human players and the PID baseline in simulation. Our proposed framework significantly reduces the gaps between different platforms and the Sim2Real gap, enabling the trained agent to achieve similar performance in both simulation and the real world, driving the vehicle effectively.
♻ ☆ Improving the forecast accuracy of wind power by leveraging multiple hierarchical structure
Renewable energy generation is of utmost importance for global decarbonization. Forecasting renewable energies, particularly wind energy, is challenging due to the inherent uncertainty in wind energy generation, which depends on weather conditions. Recent advances in hierarchical forecasting through reconciliation have demonstrated a significant increase in the quality of wind energy forecasts for short-term periods. We leverage the cross-sectional and temporal hierarchical structure of turbines in wind farms and build cross-temporal hierarchies to further investigate how integrated cross-sectional and temporal dimensions can add value to forecast accuracy in wind farms. We found that cross-temporal reconciliation was superior to individual cross-sectional reconciliation at multiple temporal aggregations. Additionally, machine learning based forecasts that were cross-temporally reconciled demonstrated high accuracy at coarser temporal granularities, which may encourage adoption for short-term wind forecasts. Empirically, we provide insights for decision-makers on the best methods for forecasting high-frequency wind data across different forecasting horizons and levels.
comment: 41 pages, 14 figures
♻ ☆ DeepMIF: Deep Monotonic Implicit Fields for Large-Scale LiDAR 3D Mapping
Recently, significant progress has been achieved in sensing real large-scale outdoor 3D environments, particularly by using modern acquisition equipment such as LiDAR sensors. Unfortunately, they are fundamentally limited in their ability to produce dense, complete 3D scenes. To address this issue, recent learning-based methods integrate neural implicit representations and optimizable feature grids to approximate surfaces of 3D scenes. However, naively fitting samples along raw LiDAR rays leads to noisy 3D mapping results due to the nature of sparse, conflicting LiDAR measurements. Instead, in this work we depart from fitting LiDAR data exactly, instead letting the network optimize a non-metric monotonic implicit field defined in 3D space. To fit our field, we design a learning system integrating a monotonicity loss that enables optimizing neural monotonic fields and leverages recent progress in large-scale 3D mapping. Our algorithm achieves high-quality dense 3D mapping performance as captured by multiple quantitative and perceptual measures and visual results obtained for Mai City, Newer College, and KITTI benchmarks. The code of our approach will be made publicly available.
comment: 8 pages, 6 figures
♻ ☆ FERGI: Automatic Annotation of User Preferences for Text-to-Image Generation from Spontaneous Facial Expression Reaction
Researchers have proposed to use data of human preference feedback to fine-tune text-to-image generative models. However, the scalability of human feedback collection has been limited by its reliance on manual annotation. Therefore, we develop and test a method to automatically score user preferences from their spontaneous facial expression reaction to the generated images. We collect a dataset of Facial Expression Reaction to Generated Images (FERGI) and show that the activations of multiple facial action units (AUs) are highly correlated with user evaluations of the generated images. We develop an FAU-Net (Facial Action Units Neural Network), which receives inputs from an AU estimation model, to automatically score user preferences for text-to-image generation based on their facial expression reactions, which is complementary to the pre-trained scoring models based on the input text prompts and generated images. Integrating our FAU-Net valence score with the pre-trained scoring models improves their consistency with human preferences. This method of automatic annotation with facial expression analysis can be potentially generalized to other generation tasks. The code is available at https://github.com/ShuangquanFeng/FERGI, and the dataset is also available at the same link for research purposes.
♻ ☆ Trade-off between Gradient Measurement Efficiency and Expressivity in Deep Quantum Neural Networks
Quantum neural networks (QNNs) require an efficient training algorithm to achieve practical quantum advantages. A promising approach is the use of gradient-based optimization algorithms, where gradients are estimated through quantum measurements. However, general QNNs lack an efficient gradient measurement algorithm, which poses a fundamental and practical challenge to realizing scalable QNNs. In this work, we rigorously prove a trade-off between gradient measurement efficiency, defined as the mean number of simultaneously measurable gradient components, and expressivity in a wide class of deep QNNs, elucidating the theoretical limits and possibilities of efficient gradient estimation. This trade-off implies that a more expressive QNN requires a higher measurement cost in gradient estimation, whereas we can increase gradient measurement efficiency by reducing the QNN expressivity to suit a given task. We further propose a general QNN ansatz called the stabilizer-logical product ansatz (SLPA), which can reach the upper limit of the trade-off inequality by leveraging the symmetric structure of the quantum circuit. In learning an unknown symmetric function, the SLPA drastically reduces the quantum resources required for training while maintaining accuracy and trainability compared to a well-designed symmetric circuit based on the parameter-shift method. Our results not only reveal a theoretical understanding of efficient training in QNNs but also provide a standard and broadly applicable efficient QNN design.
comment: 31 pages, 11 figures
♻ ☆ Domain-decoupled Physics-informed Neural Networks with Closed-form Gradients for Fast Model Learning of Dynamical Systems
Physics-informed neural networks (PINNs) are trained using physical equations and can also incorporate unmodeled effects by learning from data. PINNs for control (PINCs) of dynamical systems are gaining interest due to their prediction speed compared to classical numerical integration methods for nonlinear state-space models, making them suitable for real-time control applications. We introduce the domain-decoupled physics-informed neural network (DD-PINN) to address current limitations of PINC in handling large and complex nonlinear dynamical systems. The time domain is decoupled from the feed-forward neural network to construct an Ansatz function, allowing for calculation of gradients in closed form. This approach significantly reduces training times, especially for large dynamical systems, compared to PINC, which relies on graph-based automatic differentiation. Additionally, the DD-PINN inherently fulfills the initial condition and supports higher-order excitation inputs, simplifying the training process and enabling improved prediction accuracy. Validation on three systems - a nonlinear mass-spring-damper, a five-mass-chain, and a two-link robot - demonstrates that the DD-PINN achieves significantly shorter training times. In cases where the PINC's prediction diverges, the DD-PINN's prediction remains stable and accurate due to higher physics loss reduction or use of a higher-order excitation input. The DD-PINN allows for fast and accurate learning of large dynamical systems previously out of reach for the PINC.
comment: Accepted to International Conference on Informatics in Control, Automation and Robotics (ICINCO) 2024
♻ ☆ Solid Waste Detection, Monitoring and Mapping in Remote Sensing Images: A Survey
The detection and characterization of illegal solid waste disposal sites are essential for environmental protection, particularly for mitigating pollution and health hazards. Improperly managed landfills contaminate soil and groundwater via rainwater infiltration, posing threats to both animals and humans. Traditional landfill identification approaches, such as on-site inspections, are time-consuming and expensive. Remote sensing is a cost-effective solution for the identification and monitoring of solid waste disposal sites that enables broad coverage and repeated acquisitions over time. Earth Observation (EO) satellites, equipped with an array of sensors and imaging capabilities, have been providing high-resolution data for several decades. Researchers proposed specialized techniques that leverage remote sensing imagery to perform a range of tasks such as waste site detection, dumping site monitoring, and assessment of suitable locations for new landfills. This review aims to provide a detailed illustration of the most relevant proposals for the detection and monitoring of solid waste sites by describing and comparing the approaches, the implemented techniques, and the employed data. Furthermore, since the data sources are of the utmost importance for developing an effective solid waste detection model, a comprehensive overview of the satellites and publicly available data sets is presented. Finally, this paper identifies the open issues in the state-of-the-art and discusses the relevant research directions for reducing the costs and improving the effectiveness of novel solid waste detection methods.
♻ ☆ Beyond Uniform Query Distribution: Key-Driven Grouped Query Attention
The Transformer architecture has revolutionized deep learning through its Self-Attention mechanism, which effectively captures contextual information. However, the memory footprint of Self-Attention presents significant challenges for long-sequence tasks. Grouped Query Attention (GQA) addresses this issue by grouping queries and mean-pooling the corresponding key-value heads - reducing the number of overall parameters and memory requirements in a flexible manner without adversely compromising model accuracy. In this work, we introduce enhancements to GQA, focusing on two novel approaches that deviate from the static nature of grouping: Key-Distributed GQA (KDGQA) and Dynamic Key-Distributed GQA (DGQA), which leverage information from the norms of the key heads to inform query allocation. Specifically, KDGQA looks at the ratios of the norms of the key heads during each forward pass, while DGQA examines the ratios of the norms as they evolve through training. Additionally, we present Perturbed GQA (PGQA) as a case-study, which introduces variability in (static) group formation via subtracting noise from the attention maps. Our experiments with up-trained Vision Transformers, for Image Classification on datasets such as CIFAR-10, CIFAR-100, Food101, and Tiny ImageNet, demonstrate the promise of these variants in improving upon the original GQA through more informed and adaptive grouping mechanisms: specifically ViT-L experiences accuracy gains of up to 8% when utilizing DGQA in comparison to GQA and other variants. We further analyze the impact of the number of Key-Value Heads on performance, underscoring the importance of utilizing query-key affinities. Code is available on GitHub.
comment: 11 pages, 9 figures
♻ ☆ Symplectic Bregman divergences
We present a generalization of Bregman divergences in symplectic vector spaces that we term symplectic Bregman divergences. Symplectic Bregman divergences are derived from a symplectic generalization of the Fenchel-Young inequality which relies on the notion of symplectic subdifferentials. The symplectic Fenchel-Young inequality is obtained using the symplectic Fenchel transform which is defined with respect to the symplectic form. Since symplectic forms can be generically built from pairings of dual systems, we get a generalization of Bregman divergences in dual systems obtained by equivalent symplectic Bregman divergences. In particular, when the symplectic form is derived from an inner product, we show that the corresponding symplectic Bregman divergences amount to ordinary Bregman divergences with respect to composite inner products. Some potential applications of symplectic divergences in geometric mechanics, information geometry, and learning dynamics in machine learning are touched upon.
comment: 14 pages, 3 figures
♻ ☆ AnomalyLLM: Few-shot Anomaly Edge Detection for Dynamic Graphs using Large Language Models
Detecting anomaly edges for dynamic graphs aims to identify edges significantly deviating from the normal pattern and can be applied in various domains, such as cybersecurity, financial transactions and AIOps. With the evolving of time, the types of anomaly edges are emerging and the labeled anomaly samples are few for each type. Current methods are either designed to detect randomly inserted edges or require sufficient labeled data for model training, which harms their applicability for real-world applications. In this paper, we study this problem by cooperating with the rich knowledge encoded in large language models(LLMs) and propose a method, namely AnomalyLLM. To align the dynamic graph with LLMs, AnomalyLLM pre-trains a dynamic-aware encoder to generate the representations of edges and reprograms the edges using the prototypes of word embeddings. Along with the encoder, we design an in-context learning framework that integrates the information of a few labeled samples to achieve few-shot anomaly detection. Experiments on four datasets reveal that AnomalyLLM can not only significantly improve the performance of few-shot anomaly detection, but also achieve superior results on new anomalies without any update of model parameters.
comment: 13pages
♻ ☆ ViIK: Flow-based Vision Inverse Kinematics Solver with Fusing Collision Checking
Inverse Kinematics (IK) is to find the robot's configurations that satisfy the target pose of the end effector. In motion planning, diverse configurations were required in case a feasible trajectory was not found. Meanwhile, collision checking (CC), e.g. Oriented bounding box (OBB), Discrete Oriented Polytope (DOP), and Quickhull \cite{quickhull}, needs to be done for each configuration provided by the IK solver to ensure every goal configuration for motion planning is available. This means the classical IK solver and CC algorithm should be executed repeatedly for every configuration. Thus, the preparation time is long when the required number of goal configurations is large, e.g. motion planning in cluster environments. Moreover, structured maps, which might be difficult to obtain, were required by classical collision-checking algorithms. To sidestep such two issues, we propose a flow-based vision method that can output diverse available configurations by fusing inverse kinematics and collision checking, named Vision Inverse Kinematics solver (ViIK). Moreover, ViIK uses RGB images as the perception of environments. ViIK can output 1000 configurations within 40 ms, and the accuracy is about 3 millimeters and 1.5 degrees. The higher accuracy can be obtained by being refined by the classical IK solver within a few iterations. The self-collision rates can be lower than 2%. The collision-with-env rates can be lower than 10% in most scenes. The code is available at: https://github.com/AdamQLMeng/ViIK.
♻ ☆ Physics-Informed Neural Network for Concrete Manufacturing Process Optimization
Concrete manufacturing projects are one of the most common ones for consulting agencies. Because of the highly non-linear dependency of input materials like ash, water, cement, superplastic, etc; with the resultant strength of concrete, it gets difficult for machine learning models to successfully capture this relation and perform cost optimizations. This paper highlights how PINNs (Physics Informed Neural Networks) can be useful in the given situation. This state-of-the-art model shall also get compared with traditional models like Linear Regression, Random Forest, Gradient Boosting, and Deep Neural Network. Results of the research highlights how well PINNs performed even with reduced dataset, thus resolving one of the biggest issues of limited data availability for ML models. On an average, PINN got the loss value reduced by 26.3% even with 40% lesser data compared to the Deep Neural Network. In addition to predicting strength of the concrete given the quantity of raw materials, the paper also highlights the use of heuristic optimization method like Particle Swarm Optimization (PSO) in predicting quantity of raw materials required to manufacture concrete of given strength with least cost.
♻ ☆ Procedural Adherence and Interpretability Through Neuro-Symbolic Generative Agents
The surge in popularity of large language models (LLMs) has opened doors for new approaches to the creation of interactive agents. However, managing and interpreting the temporal behavior of such agents over the course of a potentially infinite interaction remain challenging. The stateful, long-term horizon reasoning required for coherent agent behavior does not fit well into the LLM paradigm. We propose a combination of formal logic-based program synthesis and LLM content generation to bring guarantees of procedural adherence and interpretability to generative agent behavior. To illustrate the benefit of procedural adherence and interpretability, we use Temporal Stream Logic (TSL) to generate an automaton that enforces an interpretable, high-level temporal structure on an agent. With the automaton tracking the context of the interaction and making decisions to guide the conversation accordingly, we can drive content generation in a way that allows the LLM to focus on a shorter context window. We evaluated our approach on different tasks involved in creating an interactive agent specialized for generating choose-your-own-adventure games. We found that over all of the tasks, an automaton-enhanced agent with procedural guarantees achieves at least 96% adherence to its temporal constraints, whereas a purely LLM-based agent demonstrates as low as 14.67% adherence.
comment: 11 pages
♻ ☆ Wireless Channel Aware Data Augmentation Methods for Deep Learning-Based Indoor Localization
Indoor localization is a challenging problem that - unlike outdoor localization - lacks a universal and robust solution. Machine Learning (ML), particularly Deep Learning (DL), methods have been investigated as a promising approach. Although such methods bring remarkable localization accuracy, they heavily depend on the training data collected from the environment. The data collection is usually a laborious and time-consuming task, but Data Augmentation (DA) can be used to alleviate this issue. In this paper, different from previously used DA, we propose methods that utilize the domain knowledge about wireless propagation channels and devices. The methods exploit the typical hardware component drift in the transceivers and/or the statistical behavior of the channel, in combination with the measured Power Delay Profile (PDP). We comprehensively evaluate the proposed methods to demonstrate their effectiveness. This investigation mainly focuses on the impact of factors such as the number of measurements, augmentation proportion, and the environment of interest impact the effectiveness of the different DA methods. We show that in the low-data regime (few actual measurements available), localization accuracy increases up to 50%, matching non-augmented results in the high-data regime. In addition, the proposed methods may outperform the measurement-only high-data performance by up to 33% using only 1/4 of the amount of measured data. We also exhibit the effect of different training data distribution and quality on the effectiveness of DA. Finally, we demonstrate the power of the proposed methods when employed along with Transfer Learning (TL) to address the data scarcity in target and/or source environments.
comment: 13 pages, 14 figures
♻ ☆ Lipschitz-regularized gradient flows and generative particle algorithms for high-dimensional scarce data
We build a new class of generative algorithms capable of efficiently learning an arbitrary target distribution from possibly scarce, high-dimensional data and subsequently generate new samples. These generative algorithms are particle-based and are constructed as gradient flows of Lipschitz-regularized Kullback-Leibler or other $f$-divergences, where data from a source distribution can be stably transported as particles, towards the vicinity of the target distribution. As a highlighted result in data integration, we demonstrate that the proposed algorithms correctly transport gene expression data points with dimension exceeding 54K, while the sample size is typically only in the hundreds.
♻ ☆ Decentralized Online Learning for Random Inverse Problems Over Graphs
We propose a decentralized online learning algorithm for distributed random inverse problems over network graphs with online measurements, and unifies the distributed parameter estimation in Hilbert spaces and the least mean square problem in reproducing kernel Hilbert spaces (RKHS-LMS). We transform the convergence of the algorithm into the asymptotic stability of a class of inhomogeneous random difference equations in Hilbert spaces with $L_{2}$-bounded martingale difference terms and develop the $L_2$-asymptotic stability theory in Hilbert spaces. We show that if the network graph is connected and the sequence of forward operators satisfies the infinite-dimensional spatio-temporal persistence of excitation condition, then the estimates of all nodes are mean square and almost surely strongly consistent. Moreover, we propose a decentralized online learning algorithm in RKHS based on non-stationary online data streams, and prove that the algorithm is mean square and almost surely strongly consistent if the operators induced by the random input data satisfy the infinite-dimensional spatio-temporal persistence of excitation condition.
♻ ☆ Quantum-machine-assisted Drug Discovery: Survey and Perspective
Drug discovery and development is a highly complex and costly endeavor, typically requiring over a decade and substantial financial investment to bring a new drug to market. Traditional computer-aided drug design (CADD) has made significant progress in accelerating this process, but the development of quantum computing offers potential due to its unique capabilities. This paper discusses the integration of quantum computing into drug discovery and development, focusing on how quantum technologies might accelerate and enhance various stages of the drug development cycle. Specifically, we explore the application of quantum computing in addressing challenges related to drug discovery, such as molecular simulation and the prediction of drug-target interactions, as well as the optimization of clinical trial outcomes. By leveraging the inherent capabilities of quantum computing, we might be able to reduce the time and cost associated with bringing new drugs to market, ultimately benefiting public health.
comment: 27 pages, 10 figures
♻ ☆ Heat Death of Generative Models in Closed-Loop Learning
Improvement and adoption of generative machine learning models is rapidly accelerating, as exemplified by the popularity of LLMs (Large Language Models) for text, and diffusion models for image generation. As generative models become widespread, data they generate is incorporated into shared content through the public web. This opens the question of what happens when data generated by a model is fed back to the model in subsequent training campaigns. This is a question about the stability of the training process, whether the distribution of publicly accessible content, which we refer to as "knowledge", remains stable or collapses. Small scale empirical experiments reported in the literature show that this closed-loop training process is prone to degenerating. Models may start producing gibberish data, or sample from only a small subset of the desired data distribution (a phenomenon referred to as mode collapse). So far there has been only limited theoretical understanding of this process, in part due to the complexity of the deep networks underlying these generative models. The aim of this paper is to provide insights into this process (that we refer to as "generative closed-loop learning") by studying the learning dynamics of generative models that are fed back their own produced content in addition to their original training dataset. The sampling of many of these models can be controlled via a "temperature" parameter. Using dynamical systems tools, we show that, unless a sufficient amount of external data is introduced at each iteration, any non-trivial temperature leads the model to asymptotically degenerate. In fact, either the generative distribution collapses to a small set of outputs or becomes uniform over a large set of outputs.
♻ ☆ Graphical vs. Deep Generative Models: Measuring the Impact of Differentially Private Mechanisms and Budgets on Utility CCS 2024
Generative models trained with Differential Privacy (DP) can produce synthetic data while reducing privacy risks. However, navigating their privacy-utility tradeoffs makes finding the best models for specific settings/tasks challenging. This paper bridges this gap by profiling how DP generative models for tabular data distribute privacy budgets across rows and columns, which is one of the primary sources of utility degradation. We compare graphical and deep generative models, focusing on the key factors contributing to how privacy budgets are spent, i.e., underlying modeling techniques, DP mechanisms, and data dimensionality. Through our measurement study, we shed light on the characteristics that make different models suitable for various settings and tasks. For instance, we find that graphical models distribute privacy budgets horizontally and thus cannot handle relatively wide datasets for a fixed training time; also, the performance on the task they were optimized for monotonically increases with more data but could also overfit. Deep generative models spend their budgets per iteration, so their behavior is less predictable with varying dataset dimensions, but are more flexible as they could perform better if trained on more features. Moreover, low levels of privacy ($\epsilon\geq100$) could help some models generalize, achieving better results than without applying DP. We believe our work will aid the deployment of DP synthetic data techniques by navigating through the best candidate models vis-a-vis the dataset features, desired privacy levels, and downstream tasks.
comment: A shorter version of this paper appears in the Proceedings of the 31st ACM Conference on Computer and Communications Security (ACM CCS 2024). This is the full version
Multimedia
☆ Kangaroo: A Powerful Video-Language Model Supporting Long-context Video Input
Rapid advancements have been made in extending Large Language Models (LLMs) to Large Multi-modal Models (LMMs). However, extending input modality of LLMs to video data remains a challenging endeavor, especially for long videos. Due to insufficient access to large-scale high-quality video data and the excessive compression of visual features, current methods exhibit limitations in effectively processing long videos. In this paper, we introduce Kangaroo, a powerful Video LMM aimed at addressing these challenges. Confronted with issue of inadequate training data, we develop a data curation system to build a large-scale dataset with high-quality annotations for vision-language pre-training and instruction tuning. In addition, we design a curriculum training pipeline with gradually increasing resolution and number of input frames to accommodate long videos. Evaluation results demonstrate that, with 8B parameters, Kangaroo achieves state-of-the-art performance across a variety of video understanding benchmarks while exhibiting competitive results on others. Particularly, on benchmarks specialized for long videos, Kangaroo excels some larger models with over 10B parameters and proprietary models.
☆ A Simple Baseline with Single-encoder for Referring Image Segmentation
Referring image segmentation (RIS) requires dense vision-language interactions between visual pixels and textual words to segment objects based on a given description. However, commonly adapted dual-encoders in RIS, e.g., Swin transformer and BERT (uni-modal encoders) or CLIP (a multi-modal dual-encoder), lack dense multi-modal interactions during pre-training, leading to a gap with a pixel-level RIS task. To bridge this gap, existing RIS methods often rely on multi-modal fusion modules that interact two encoders, but this approach leads to high computational costs. In this paper, we present a novel RIS method with a single-encoder, i.e., BEiT-3, maximizing the potential of shared self-attention across all framework components. This enables seamless interactions of two modalities from input to final prediction, producing granularly aligned multi-modal features. Furthermore, we propose lightweight yet effective decoder modules, a Shared FPN and a Shared Mask Decoder, which contribute to the high efficiency of our model. Our simple baseline with a single encoder achieves outstanding performances on the RIS benchmark datasets while maintaining computational efficiency, compared to the most recent SoTA methods based on dual-encoders.
comment: ArXiv pre-print
☆ Hand1000: Generating Realistic Hands from Text with Only 1,000 Images
Text-to-image generation models have achieved remarkable advancements in recent years, aiming to produce realistic images from textual descriptions. However, these models often struggle with generating anatomically accurate representations of human hands. The resulting images frequently exhibit issues such as incorrect numbers of fingers, unnatural twisting or interlacing of fingers, or blurred and indistinct hands. These issues stem from the inherent complexity of hand structures and the difficulty in aligning textual descriptions with precise visual depictions of hands. To address these challenges, we propose a novel approach named Hand1000 that enables the generation of realistic hand images with target gesture using only 1,000 training samples. The training of Hand1000 is divided into three stages with the first stage aiming to enhance the model's understanding of hand anatomy by using a pre-trained hand gesture recognition model to extract gesture representation. The second stage further optimizes text embedding by incorporating the extracted hand gesture representation, to improve alignment between the textual descriptions and the generated hand images. The third stage utilizes the optimized embedding to fine-tune the Stable Diffusion model to generate realistic hand images. In addition, we construct the first publicly available dataset specifically designed for text-to-hand image generation. Based on the existing hand gesture recognition dataset, we adopt advanced image captioning models and LLaMA3 to generate high-quality textual descriptions enriched with detailed gesture information. Extensive experiments demonstrate that Hand1000 significantly outperforms existing models in producing anatomically correct hand images while faithfully representing other details in the text, such as faces, clothing, and colors.
comment: Project page https://haozhuo-zhang.github.io/Hand1000-project-page/
☆ SVDD 2024: The Inaugural Singing Voice Deepfake Detection Challenge
With the advancements in singing voice generation and the growing presence of AI singers on media platforms, the inaugural Singing Voice Deepfake Detection (SVDD) Challenge aims to advance research in identifying AI-generated singing voices from authentic singers. This challenge features two tracks: a controlled setting track (CtrSVDD) and an in-the-wild scenario track (WildSVDD). The CtrSVDD track utilizes publicly available singing vocal data to generate deepfakes using state-of-the-art singing voice synthesis and conversion systems. Meanwhile, the WildSVDD track expands upon the existing SingFake dataset, which includes data sourced from popular user-generated content websites. For the CtrSVDD track, we received submissions from 47 teams, with 37 surpassing our baselines and the top team achieving a 1.65% equal error rate. For the WildSVDD track, we benchmarked the baselines. This paper reviews these results, discusses key findings, and outlines future directions for SVDD research.
♻ ☆ Proceedings of The second international workshop on eXplainable AI for the Arts (XAIxArts)
This second international workshop on explainable AI for the Arts (XAIxArts) brought together a community of researchers in HCI, Interaction Design, AI, explainable AI (XAI), and digital arts to explore the role of XAI for the Arts. Workshop held at the 16th ACM Conference on Creativity and Cognition (C&C 2024), Chicago, USA.
comment: Proceedings of The second international workshop on eXplainable AI for the Arts (XAIxArts)
♻ ☆ MambaGesture: Enhancing Co-Speech Gesture Generation with Mamba and Disentangled Multi-Modality Fusion ACM MM 2024
Co-speech gesture generation is crucial for producing synchronized and realistic human gestures that accompany speech, enhancing the animation of lifelike avatars in virtual environments. While diffusion models have shown impressive capabilities, current approaches often overlook a wide range of modalities and their interactions, resulting in less dynamic and contextually varied gestures. To address these challenges, we present MambaGesture, a novel framework integrating a Mamba-based attention block, MambaAttn, with a multi-modality feature fusion module, SEAD. The MambaAttn block combines the sequential data processing strengths of the Mamba model with the contextual richness of attention mechanisms, enhancing the temporal coherence of generated gestures. SEAD adeptly fuses audio, text, style, and emotion modalities, employing disentanglement to deepen the fusion process and yield gestures with greater realism and diversity. Our approach, rigorously evaluated on the multi-modal BEAT dataset, demonstrates significant improvements in Fr\'echet Gesture Distance (FGD), diversity scores, and beat alignment, achieving state-of-the-art performance in co-speech gesture generation. Project website: $\href{https://fcchit.github.io/mambagesture/}{\textit{https://fcchit.github.io/mambagesture/}}$.
comment: Accepted by ACM MM 2024
♻ ☆ AIM 2024 Challenge on Compressed Video Quality Assessment: Methods and Results
Video quality assessment (VQA) is a crucial task in the development of video compression standards, as it directly impacts the viewer experience. This paper presents the results of the Compressed Video Quality Assessment challenge, held in conjunction with the Advances in Image Manipulation (AIM) workshop at ECCV 2024. The challenge aimed to evaluate the performance of VQA methods on a diverse dataset of 459 videos, encoded with 14 codecs of various compression standards (AVC/H.264, HEVC/H.265, AV1, and VVC/H.266) and containing a comprehensive collection of compression artifacts. To measure the methods performance, we employed traditional correlation coefficients between their predictions and subjective scores, which were collected via large-scale crowdsourced pairwise human comparisons. For training purposes, participants were provided with the Compressed Video Quality Assessment Dataset (CVQAD), a previously developed dataset of 1022 videos. Up to 30 participating teams registered for the challenge, while we report the results of 6 teams, which submitted valid final solutions and code for reproducing the results. Moreover, we calculated and present the performance of state-of-the-art VQA methods on the developed dataset, providing a comprehensive benchmark for future research. The dataset, results, and online leaderboard are publicly available at https://challenges.videoprocessing.ai/challenges/compressedvideo-quality-assessment.html.